Selecting columns not present in the dataframe - apache-spark

So, I am creating a dataframe from an XML file. It has some information on a dealer, and then a dealer has multiple cars - each car is an sub-element of the cars element and is represented by a value element - each cars.value element has various car attributes. So I use an explode function to create one row for each car for a dealer like follows:
exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))
And now I want to get various attributes of cars.value
I do it like this:
car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')
And that works fine. But sometimes the cars.value elements doesn't have all the attributes I specify in my query. So for example some cars.value elements might have only attribute1 - and then I will get a following error when running the above code:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'attribute2'
given input columns: [dealer_id,attribute1];"
How do I ask Spark to execute the same query anyway. but just return None for the attribute2 if it is not present?
UPDATE I read my data as follows:
initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')
exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))

Since you already make specific assumptions about the schema the best thing you can do is to define it explicitly with nullable optional fields and use it when importing data.
Let's say you expect documents similar to:
<rows>
<row>
<id>1</id>
<objects>
<object>
<attribute1>...</attribute1>
...
<attributebN>...</attributeN>
</object>
</objects>
</row>
</rows>
where attribute1, attribute2, ..., attributebN may not be present in a given batch but you can define a finite set of choices and corresponding types. For simplicity let's say there are only two options:
{("attribute1", StringType), ("attribute2", LongType)}
You can define schema as:
schema = StructType([
StructField("objects", StructType([
StructField("object", StructType([
StructField("attribute1", StringType(), True),
StructField("attribute2", LongType(), True)
]), True)
]), True),
StructField("id", LongType(), True)
])
and use it with reader:
spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)
It will be valid for any subset of attributes ({∅, {attribute1}, {attribute2}, {attribute1, attribute2}}). At the same time is more efficient than depending on the schema inference.

Related

Changing values of a JSON with RDD

How do you set a value in RDD once you transform it?
I am modifying a JSON file with pyspark and I have this case:
categories: [ {alias, title}, {alias, title}, {alias, title} ]
I have made a transformation that creates a list of titles for each row:
[title, title, title].
But how do I set the result back to the key categories?
At the end I want to get:
categories: [title, title, title]
This is the transformation that I am doing:
restaurantRDD.map(lambda x: x.data).flatMap(lambda x: x).map(lambda x: [row.title for row in x.categories])
There are also multiple transformations from restaurantRDD similar to this one which are modifying some parts of the JSON. How can I apply them both at the same time and then write them in a new JSON file?
Should I use something else instead of RDD?

Databricks schema enforcement issues

As suggested in the article about schema enforcement, a declared schema helps detecting issues early.
The two issues described below however are preventing me from creating a descriptive schema.
Comments on a table column are seen as a difference in the schema
# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')
# Create table: ... BOOM
spark.sql("""
CREATE TABLE IF NOT EXISTS my_table (
code STRING COMMENT 'Unique identifier'
) USING DELTA LOCATION '/my_table_location'
""")
This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.
Not null struct field shows as nullable
json_schema = StructType([
StructField("code", StringType(), False)
])
json_df = (spark.read
.schema(json_schema)
.json('/my_input.json')
)
json_df.printSchema()
will show
root
|-- code: string (nullable = true)
So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.
Any comments or suggestions are welcome.
With the execution of
test_df.write.format("delta").mode("append").save('/my_table_location')
You have already created a new Delta table with its specific schema as defined by test_df. This new table delta.`/my_table_location` already has the schema of code STRING.
If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.
spark.sql("""
CREATE TABLE my_table
code STRING COMMENT 'unique identifier'
USING DELTA LOCATION '/my_table_location'
""")
And then insert your data from your test_df into it, e.g.
test_df.createOrReplaceView("test_df_view")
spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

pyspark dataframe column name

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

What is the right way to do a semi-join on two Spark RDDs (in PySpark)?

In my PySpark application, I have two RDD's:
items - This contains item ID and item name for all valid items. Approx 100000 items.
attributeTable - This contains the fields user ID, item ID and an attribute value of this combination in that order. These is a certain attribute for each user-item combination in the system. This RDD has several 100s of 1000s of rows.
I would like to discard all rows in attributeTable RDD that don't correspond to a valid item ID (or name) in the items RDD. In other words, a semi-join by the item ID. For instance, if these were R data frames, I would have done semi_join(attributeTable, items, by="itemID")
I tried the following approach first, but found that this takes forever to return (on my local Spark installation running on a VM on my PC). Understandably so, because there are such a huge number of comparisons involved:
# Create a broadcast variable of all valid item IDs for doing filter in the drivers
validItemIDs = sc.broadcast(items.map(lambda (itemID, itemName): itemID)).collect())
attributeTable = attributeTable.filter(lambda (userID, itemID, attributes): itemID in set(validItemIDs.value))
After a bit of fiddling around, I found that the following approach works pretty fast (a min or so on my system).
# Create a broadcast variable for item ID to item name mapping (dictionary)
itemIdToNameMap = sc.broadcast(items.collectAsMap())
# From the attribute table, remove records that don't correspond to a valid item name.
# First go over all records in the table and add a dummy field indicating whether the item name is valid
# Then, filter out all rows with invalid names. Finally, remove the dummy field we added.
attributeTable = (attributeTable
.map(lambda (userID, itemID, attributes): (userID, itemID, attributes, itemIdToNameMap.value.get(itemID, 'Invalid')))
.filter(lambda (userID, itemID, attributes, itemName): itemName != 'Invalid')
.map(lambda (userID, itemID, attributes, itemName): (userID, itemID, attributes)))
Although this works well enough for my application, it feels more like a dirty workaround and I am pretty sure there must be another cleaner or idiomatically correct (and possibly more efficient) way or ways to do this in Spark. What would you suggest? I am new to both Python and Spark, so any RTFM advices will also be helpful if you could point me to the right resources.
My Spark version is 1.3.1.
Just do a regular join and then discard the "lookup" relation (in your case items rdd).
If these are your RDDs (example taken from another answer):
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
then you'd do:
attributeTable.keyBy(lambda x: x[1])
.join(items)
.map(lambda (key, (attribute, item)): attribute)
And as a result, you only have tuples from attributeTable RDD which have a corresponding entry in the items RDD:
[(123456, 123, 'Attribute for A')]
Doing it via leftOuterJoin as suggested in another answer will also do the job, but is less efficient. Also, the other answer semi-joins items with attributeTable instead of attributeTable with items.
As others have pointed out, this is probably most easily accomplished by leveraging DataFrames. However, you might be able to accomplish your intended goal by using the leftOuterJoin and the filter functions. Something a bit hackish like the following might suffice:
items = sc.parallelize([(123, "Item A"), (456, "Item B")])
attributeTable = sc.parallelize([(123456, 123, "Attribute for A")])
sorted(items.leftOuterJoin(attributeTable.keyBy(lambda x: x[1]))
.filter(lambda x: x[1][1] is not None)
.map(lambda x: (x[0], x[1][0])).collect())
returns
[(123, 'Item A')]

Resources