I am reading a string of multiple JSONs and converting to multiple columns in PySpark dataframe. The JSON elements ma or may not have null values. My code works fine when all elements in the JSON are non-null. However if single element is null, it makes all the elements null. Here's an example,
Input:
Note that addresses looks like an array but it is actually a string.
id='1'
addresses='
[{
"city": "city1",
"state": null,
"street": null,
"postalCode": null,
"country": "country1"
}
,
{
"city": "city2",
"state": null,
"street": "street2",
"postalCode": "11111",
"country": "country2"
}]'
Expected output:
id city state street postalCode country
1 city1 null null null country1
1 city2 null street2 11111 country2
My code:
addl_addr_schema = ArrayType(StructType([
StructField("addl_addr_city", StringType(), True),
StructField("addl_addr_state", StringType(), True),
StructField("addl_addr_street", StringType(), True),
StructField("addl_addr_postalCode", StringType(), True),
StructField("addl_addr_country", StringType(), True),
]))
dpDF_transformed = dpDF_temp.withColumn('addresses_transformed', from_json('addresses', addl_addr_schema)) \
.withColumn('addl_addr', explode_outer('addresses_transformed'))
dpDF_transformed = dpDF_transformed.select("*",col("addresses_transformed.addl_addr_street").alias("addl_addr_street_array"),col("addresses_transformed.addl_addr_city").alias("addl_addr_city_array"),col("addresses_transformed.addl_addr_state").alias("addl_addr_state_array"),col("addresses_transformed.addl_addr_postalCode").alias("addl_addr_postalCode_array"),col("addresses_transformed.addl_addr_country").alias("addl_addr_country_array"))
dpDF_final = dpDF_transformed.withColumn("addl_addr_street",concat_ws(",","addl_addr_street_array")) \
.withColumn("addl_addr_city",concat_ws(",","addl_addr_city_array")) \ .withColumn("addl_addr_state",concat_ws(",","addl_addr_state_array")) \
.withColumn("addl_addr_postalCode",concat_ws(",","addl_addr_postalCode_array")) \ .withColumn("addl_addr_country",concat_ws(",","addl_addr_country_array")) \ .drop("addresses","addresses_transformed","addl_addr","addl_addr_street_array","addl_addr_city_array","addl_addr_state_array","addl_addr_postalCode_array","addl_addr_country_array")
Output I am getting
id city state street postalCode country
1 city1 null null null null
1 city2 null null null null
I believe what is happening is, from_json is seeing a type mismatch. I have defined every element as StringType() but some elements are actually NullType. How do I deal with this? The attributes may or may not be null. I thought making nullable = True while defining the schema would help but it doesn't seem to.
You're on the right track. You've identified that you need a schema, explode the array, and extract the columns.
Something seemed to go wrong in your from_json function. I expect that the fact that your schema (with the addl_addr_state field) and your data (with the state field) did not have the same column names, making Spark think that there is no useful data in there. You need to make sure that the fields in your schema and your data have the same names.
You can do all of this in a simpler way, using some neat tricks. The following code will get you where you want:
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
from pyspark.sql.functions import from_json, explode
id='1'
addresses="""
[{
"city": "city1",
"state": null,
"street": null,
"postalCode": null,
"country": "country1"
}
,
{
"city": "city2",
"state": null,
"street": "street2",
"postalCode": "11111",
"country": "country2"
}]"""
schema = ArrayType(StructType([
StructField("city", StringType(), True),
StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("postalCode", StringType(), True),
StructField("country", StringType(), True),
]))
# Reading in the dataframe with the raw json string in the addresses column
df = spark.createDataFrame([(id, addresses)], ["id", "addresses"])
# Parsing in our json and exploding to have a single line per city
parsed_df = df.withColumn("addresses", explode(from_json("addresses", schema)))
parsed_df.show(truncate=False)
+---+----------------------------------+
|id |addresses |
+---+----------------------------------+
|1 |[city1,,,, country1] |
|1 |[city2,, street2, 11111, country2]|
+---+----------------------------------+
# Unwrapping the addresses column with the "struct.*" notation
unwrapped_df = parsed_df.select("id", "addresses.*")
unwrapped_df.show(truncate=False)
+---+-----+-----+-------+----------+--------+
|id |city |state|street |postalCode|country |
+---+-----+-----+-------+----------+--------+
|1 |city1|null |null |null |country1|
|1 |city2|null |street2|11111 |country2|
+---+-----+-----+-------+----------+--------+
So as you see, properly reading in the data and then some slight manipulations (from_json, explode, select("struct.*")) give you a quite easy way to work around your problem.
Hope this helps!
Related
My RDD (From ElasticSearch) looks like this.
[
('rty456ui', {'#timestamp': '2022-10-10T24:56:10.000259+0000', 'host': {'id': 'test-host-id-1'}, 'watchlists': {'ioc': {'summary': '127.0.0.1', 'tags': ('Dummy Tag',)}}, 'source': {'ip': '127.0.0.1'}, 'event': {'created': '2022-10-10T13:56:10+00:00', 'id': 'rty456ui'}, 'tags': ('Mon',)}),
('cxs980qw', {'#timestamp': '2022-10-10T13:56:10.000259+0000', 'host': {'id': 'test-host-id-2'}, 'watchlists': {'ioc': {'summary': '0.0.0.1', 'tags': ('Dummy Tag',)}}, 'source': {'ip': '0.0.0.1'}, 'event': {'created': '2022-10-10T24:56:10+00:00', 'id': 'cxs980qw'}, 'tags': ('Mon', 'Tue')})
]
(What I find interesting is Lists in ES are converted to Tuples in RDD)
I am trying to convert it into something like this.
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
|host.id |event.id |source.ip |event.created |watchlists.ioc.summary |watchlists.ioc.tags |tags |
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
|test-host-id-1 |rty456ui |127.0.0.1 |2022-10-10T13:56:10+00:00 |127.0.0.1 |[Dummy Tag] |[Mon] |
|test-host-id-2 |cxs980qw |0.0.0.1 |2022-10-10T24:56:10+00:00 |127.0.0.1 |[Dummy Tag] |[Mon, Tue] |
+---------------+-----------+-----------+---------------------------+-----------------------+-----------------------+---------------+
However, getting this.
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
|host.id|event.id|source.ip|event.created|watchlists.ioc.summary|watchlists.ioc.tags|tags |
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
|null |null |null |null |null |null |[Ljava.lang.Object;#6c704e6e |
|null |null |null |null |null |null |[Ljava.lang.Object;#701ea4c8 |
+-------+--------+---------+-------------+----------------------+-------------------+-------------------------------+
Code
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField("host.id",StringType(), True),
StructField("event.id",StringType(), True),
StructField("source.ip",StringType(), True),
StructField("event.created", StringType(), True),
StructField("watchlists.ioc.summary", StringType(), True),
StructField("watchlists.ioc.tags", StringType(), True),
StructField("tags", StringType(), True)
])
df = spark.createDataFrame(es_rdd.map(lambda x: x[1]),schema)
df.show(truncate=False)
I'm trying to convert an rdd into Dataframe. Additionally, I want to define the schema for it. However, pyspark.createDataFrame(rdd, schema) returns just null values, even though the rdd has data. Further, I get [Ljava.lang.Object;#701ea4c8 in the output too. So what am I missing here?
Your post cover 2 questions:
Why all columns will be null even I declare the schema when I transform the RDD to dataframe: In your schema, you use StructTypeColumn.StructFiedColumn (eg host.id) to get the value in RDD. However, this type of selection statement could only work when you use Spark SQL select statement and I think there is no such parsing here. To achieve your goal, you might have to update your lambda function inside map function to extract the exact element like
rdd_trans = rdd.map(lambda x: (x[1]['host']['id'], x[1]['event']['id'], ))
Why the output of tag column is not shown as expected: It's because when you declare your tag column, you declare it as a string column, you should use ArrayType instead.
I am trying to load csv file to a Spark dataframe. The csv file doesn't have any header as such, but I am aware which field corresponds to what.
The problem is my csv has almost 35 odd fields but I am interested in very limited columns so is there a way by which I can load the selected columns and map them to corresponding fields as defined in my schema.
Let's say we have following CSV:
1,Michel,1256,Student,high Street, New Delhi
2,Solace,7689,Artist,M G Road, Karnataka
In Scala my Code is something like this .
val sample_schema = StructType(Array(StructField("Name", StringType, nullable = false),
StructField("unique_number", StringType, nullable = false),
StructField("state", StringType, nullable = false))
val blogsDF = sparkSession.read.schema(sample_schema)
.option("header", true)
.csv(file_path)
This will load the data into a dataframe, but it will not be in the order I want.
What I want is for csv record to be split and data is loaded as per underlying mapping
col1 --> Name
col2 --> unique id
col5 --> state
Not sure if we can do this kind of operation before loading data into DataFrame. I know another approach wherein we can load the data into one dataframe, and then select few columns and create another dataframe, just want to check if we can map during data load itself.
Any help or pointer in this regard will be really helpful.
Thanks
Ashit
Have you tried it:
schema = StructType([StructField("a", IntegerType(), True),
StructField("b", IntegerType(), True),
StructField("c", StringType(), True),
StructField("d", StringType(), True),
StructField("e", DoubleType(), True),
StructField("f", LongType(), True),
])
df = spark.read.csv('blablabla', schema=schema)
I have a dataframe that has a column that is a JSON string
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
sc = SparkSession.builder.getOrCreate()
l = [
(1, """{"key1": true, "nested_key": {"mylist": ["foo", "bar"], "mybool": true}})"""),
(2, """{"key1": true, "nested_key": {"mylist": "", "mybool": true}})"""),
]
df = sc.createDataFrame(l, ["id", "json_str"])
and want to parse the json_str column with from_json using a schema
schema = StructType([
StructField("key1", BooleanType(), False),
StructField("nested_key", StructType([
StructField("mylist", ArrayType(StringType()), False),
StructField("mybool", BooleanType(), False)
]))
])
df = df.withColumn("data", F.from_json(F.col("json_str"), schema))
df.show(truncate=False)
+---+--------------------------+
|id |data |
+---+--------------------------+
|1 |[true, [[foo, bar], true]]|
|2 |[true, [, true]] |
+---+--------------------------+
As one can see, the second row didn't conform to the schema in schema so it's null even though I passed False to nullable in the StructField. It's important to my pipeline that if there's data that doesn't conform to the schema defined that an alert get raised somehow, but I'm not sure about the best way to do this in Pyspark. The real data has many, many keys, some of them nested so checking each one with some form of isNan isn't feasable and since we already defined the schema it feels like there should be away to leverage that.
If it matters, I don't necessarily need to check the schema of the whole dataframe, I'm really after checking the schema of the StructType column
Check out the options parameter:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=from_json#pyspark.sql.functions.from_json
It's a little vague, but it allows you to pass a dict to the underlying method here:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html?highlight=from_json#pyspark.sql.DataFrameReader.json
You might have success passing something like options={'mode' : 'FAILFAST'}.
I'm using databricks and trying to read in a csv file like this:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_my_file)
)
and I'm getting the error:
AnalysisException: 'Unable to infer schema for CSV. It must be specified manually.;'
I've checked that my file is not empty, and I've also tried to specify schema myself like this:
schema = "datetime timestamp, id STRING, zone_id STRING, name INT, time INT, a INT"
df = (spark.read
.option("header", "true")
.schema(schema)
.csv(path_to_my_file)
)
But when try to see it using display(df), it just gives me this below, I'm totally lost and don't know what to do.
df.show() and df.printSchema() gives the following:
It looks like that data are not being read into the dataframe.
error snapshot:
Note, this is an incomplete answer as there isn't enough information about what your file looks like to understand why the inferSchema did not work. I've placed this response as an answer as it is too long as a comment.
Saying this, for programmatically specifying a schema, you would need to specify the schema using StructType().
Using your example of
datetime timestamp, id STRING, zone_id STRING, name INT, time INT, mod_a INT"
it would look something like this:
# Import data types
from pyspark.sql.types import *
schema = StructType(
[StructField('datetime', TimestampType(), True),
StructField('id', StringType(), True),
StructField('zone_id', StringType(), True),
StructField('name', IntegerType(), True),
StructField('time', IntegerType(), True),
StructField('mod_a', IntegerType(), True)
]
)
Note, how the df.printSchema() had specified that all of the columns were datatype string.
I discovered that the problem was caused by the filename.
Perhaps databrick is unable to read filename schemas that begin with '_'. (underscore).
I had the same problem, and when I uploaded the file without the first letter (ie, underscore), I was able to process it.
I am a little new to pyspark/bigdata so this could be a bad idea, but I have about a million individual CSV files each associated with some metadata. I would like a pyspark dataframe with columns for all the metadata fields, but also with a column whose entries are the (whole) CSV files associated with each set of metadata.
I am not at work right now but I remember almost the exact code. I have tried a toy example something like
outer_pandas_df = pd.DataFrame.from_dict({"A":[1,2,3],"B":[4,5,6]})
## A B
## 0 1 4
## 1 2 5
## 2 3 6
And then if you do
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True)
])
outer_spark_df = sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
Then the result is a spark dataframe as expected. But now if you do
inner_pandas_df = pd.DataFrame.from_dict({"W":["X","Y","Z"]})
outer_pandas_df["C"] = [inner_pandas_df, inner_pandas_df, inner_pandas_df]
And make the schema like
inner_schema = StructType([
StructField("W", StringType(), True)
])
outer_schema = StructType([
StructField("A", IntegerType(), True),
StructField("B", IntegerType(), True),
StructField("W", ArrayType(inner_schema), True)
])
then this fails:
sqlctx.createDataFrame(outer_pandas_df, schema=outer_schema)
with an error related to ArrayType not accepting pandas dataframes. I don't have the exact error.
Is what I'm trying to do possible?
Spark does not support nested dataframes. Why do you want a column that contains the entire CSV to be constantly stored in memory, anyway? It seems to me that if you need that, you are not successfully extracting the data into the other columns.