How to get a specific value with in array using SQL - databricks

I've the following array under a column 'X'. I'm trying to get task_due_date when task_type is 'sizeScale' using Spark SQL.
[{"task_type": "sizeScale", "task_due_date": "2019-01-02"}, {"task_type": "colorBreakdown", "task_due_date": "2019-01-02"}, {"task_type": "priceTicket", "task_due_date": "2019-01-02"}, {"task_type": "dcSplit", "task_due_date": "2019-01-02"}, {"task_type": "cartonLabel", "task_due_date": "2019-01-02"}]
I tried to explode the array to a separate table and join it to the original. Is there any other direct option?

Hope this helps:
data = [[[{"task_type": "sizeScale", "task_due_date": "2019-01-02"}, {"task_type": "colorBreakdown", "task_due_date": "2019-01-02"}, {"task_type": "priceTicket", "task_due_date": "2019-01-02"}, {"task_type": "dcSplit", "task_due_date": "2019-01-02"}, {"task_type": "cartonLabel", "task_due_date": "2019-01-02"}]]]
df = spark.createDataFrame(data=data,schema=["col_1"])
df.createOrReplaceTempView("sample")
df = spark.sql("select columns_1.task_due_date from (select explode(col_1) as columns_1 from sample) where columns_1.task_type = 'sizeScale'")
df.show()
Output
+-------------+
|task_due_date|
+-------------+
| 2019-01-02|
+-------------+

Related

Error while converting RDD to Dataframe [PySpark]

I'm trying to convert an RDD back to a Spark DataFrame using the code below
schema = StructType(
[StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(DoubleType()), True)]
)
DF = spark.createDataFrame(rdd, schema=schema)
The dataset has only two columns:
msn that contains only a string of character.
Input_Tensor a 2D array of float.
But I keep having this error and I'm not sure where it's coming from :
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/myproject/datasets/train.py", line 51, in EMA_detector
DF = spark.createDataFrame(rdd, schema=schema)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/sql/session.py", line 790, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2364, in _to_java_object_rdd
return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2599, in _jrdd
self._jrdd_deserializer, profiler)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2500, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2486, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/serializers.py", line 694, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: AttributeError: 'NoneType' object has no attribute 'items'
EDIT:
My RDD comes from this :
rdd = test_data.mapPartitions(lambda part: vectorizer.transform(part))
The dataset test_data is itself an RDD but somehow after the mapPartitions it's a pipelinedRDD and that seems to be why it fails.
Assuming your rdd is defined by the following data:
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
Then you can use toDF() method which will infer the data types for you. In this case string and array<array<double>>
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"])
DataFrame[msn: string, Input_Tensor: array<array<double>>]
The end result:
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"]).show(truncate=False)
+----+---------------------------------------+
|msn |Input_Tensor |
+----+---------------------------------------+
|row1|[[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]] |
|row2|[[7.7, 8.8, 9.9], [10.1, 11.11, 12.12]]|
+----+---------------------------------------+
However, you can still use createDataFrame method if the tensor is defined as a double array in the schema.
from pyspark.sql.types import StructType, StructField, ArrayType, DoubleType, StringType
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
rdd = sc.parallelize(data)
schema = StructType([
StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(ArrayType(DoubleType())), True)])
spark.createDataFrame(rdd, schema=schema).show(truncate=False)

Reading csv files in PySpark

I am trying to read csv file and convert into dataframe.
input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

Table or view not found with registerTempTable

So I run the following on pyspark shell:
>>> data = spark.read.csv("annotations_000", header=False, mode="DROPMALFORMED", schema=schema)
>>> data.show(3)
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
| item_id| review_id| text| aspect|sentiment|comments| annotation_round|
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
|9999900031|9999900031/custom...|Just came back to...|breakfast| 3| null|ASE_OpeNER_round2|
|9999900031|9999900031/custom...|Just came back to...| staff| 3| null|ASE_OpeNER_round2|
|9999900031|9999900031/custom...|The hotel was loc...| noise| 2| null|ASE_OpeNER_round2|
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
>>> data.registerTempTable("temp")
>>> df = sqlContext.sql("select first(item_id), review_id, first(text), concat_ws(';', collect_list(aspect)) as aspect from temp group by review_id")
>>> df.show(3)
+---------------------+--------------------+--------------------+--------------------+
|first(item_id, false)| review_id| first(text, false)| aspect|
+---------------------+--------------------+--------------------+--------------------+
| 100012|100012/tripadviso...|We stayed here la...| staff;room|
| 100013|100013/tripadviso...|We stayed for two...| breakfast|
| 100031|100031/tripadviso...|We stayed two nig...|noise;breakfast;room|
+---------------------+--------------------+--------------------+--------------------+
and it works perfectly with the shell sqlContext variable.
When I write it as a script:
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
sc = SparkContext(appName="AspectDetector")
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
data.registerTempTable("temp")
df = sqlContext.sql("select first(item_id), review_id, first(text), concat_ws(';', collect_list(aspect)) as aspect from temp group by review_id")
and run it I get the following:
pyspark.sql.utils.AnalysisException: u'Table or view not found: temp;
line 1 pos 99'
How is that possible? Am I doing something wrong on the instatiation of sqlContext?
First you will want to initialize spark with Hive support, for example:
spark = SparkSession.builder \
.master("yarn") \
.appName("AspectDetector") \
.enableHiveSupport() \
.getOrCreate()
sqlContext = SQLContext(spark)
But instead of using sqlContext.sql(), you will want to use spark.sql() to run your query.
I found this confusing as well but I think it is because when you do the data.registerTempTable("temp") you are actually in the spark context instead of the sqlContext context. If you want to query a hive table, you should still use sqlContext.sql().

How to reference a dataframe when in an UDF on another dataframe?

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?
Here's a dummy example. I am creating two dataframes scores and lastnames, and within each lies a column that is the same across the two dataframes. In the UDF applied on scores, I want to filter on lastnames and return a string found in lastname.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sc = SparkContext("local")
sqlCtx = SQLContext(sc)
# Generate Random Data
import itertools
import random
student_ids = ['student1', 'student2', 'student3']
subjects = ['Math', 'Biology', 'Chemistry', 'Physics']
random.seed(1)
data = []
for (student_id, subject) in itertools.product(student_ids, subjects):
data.append((student_id, subject, random.randint(0, 100)))
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("student_id", StringType(), nullable=False),
StructField("subject", StringType(), nullable=False),
StructField("score", IntegerType(), nullable=False)
])
# Create DataFrame
rdd = sc.parallelize(data)
scores = sqlCtx.createDataFrame(rdd, schema)
# create another dataframe
last_name = ["Granger", "Weasley", "Potter"]
data2 = []
for i in range(len(student_ids)):
data2.append((student_ids[i], last_name[i]))
schema = StructType([
StructField("student_id", StringType(), nullable=False),
StructField("last_name", StringType(), nullable=False)
])
rdd = sc.parallelize(data2)
lastnames = sqlCtx.createDataFrame(rdd, schema)
scores.show()
lastnames.show()
from pyspark.sql.functions import udf
def getLastName(sid):
tmp_df = lastnames.filter(lastnames.student_id == sid)
return tmp_df.last_name
getLastName_udf = udf(getLastName, StringType())
scores.withColumn("last_name", getLastName_udf("student_id")).show(10)
And the following is the last part of the trace:
Py4JError: An error occurred while calling o114.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
You can't directly reference a dataframe (or an RDD) from inside a UDF. The DataFrame object is a handle on your driver that spark uses to represent the data and actions that will happen out on the cluster. The code inside your UDF's will run out on the cluster at a time of Spark's choosing. Spark does this by serializing that code, and making copies of any variables included in the closure and sending them out to each worker.
What instead you want to do, is use the constructs Spark provides in it's API to join/combine the two DataFrames. If one of the data sets is small, you can manually send out the data in a broadcast variable, and then access it from your UDF. Otherwise, you can just create the two dataframes like you did, then use the join operation to combine them. Something like this should work:
joined = scores.withColumnRenamed("student_id", "join_id")
joined = joined.join(lastnames, joined.join_id == lastnames.student_id)\
.drop("join_id")
joined.show()
+---------+-----+----------+---------+
| subject|score|student_id|last_name|
+---------+-----+----------+---------+
| Math| 13| student1| Granger|
| Biology| 85| student1| Granger|
|Chemistry| 77| student1| Granger|
| Physics| 25| student1| Granger|
| Math| 50| student2| Weasley|
| Biology| 45| student2| Weasley|
|Chemistry| 65| student2| Weasley|
| Physics| 79| student2| Weasley|
| Math| 9| student3| Potter|
| Biology| 2| student3| Potter|
|Chemistry| 84| student3| Potter|
| Physics| 43| student3| Potter|
+---------+-----+----------+---------+
It's also worth noting, that under the hood Spark DataFrames has an optimization where a DataFrame that is part of a join can be converted to a broadcast variable to avoid a shuffle if it is small enough. So if you do the join method listed above, you should get the best possible performance, without sacrificing the ability to handle larger data sets.
Changing pair to dictionary for easy lookup of names
data2 = {}
for i in range(len(student_ids)):
data2[student_ids[i]] = last_name[i]
Instead of creating rdd and making it to df create broadcast variable
//rdd = sc.parallelize(data2)
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)
Now access this in udf with values attr on broadcast variable(lastnames).
from pyspark.sql.functions import udf
def getLastName(sid):
return lastnames.value[sid]

Create DataFrame from list of tuples using pyspark

I am working with data extracted from SFDC using simple-salesforce package.
I am using Python3 for scripting and Spark 1.5.2.
I created an rdd containing the following data:
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')]
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')]
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
...
This data is in RDD called v_rdd
My schema looks like this:
StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))
I am trying to create DataFrame out of this RDD:
sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)
I print my DataFrame:
sqlDataFrame.printSchema()
And get the following:
+--------------------+--------------------+--------------------+
| Id| PackSize| Name|
+--------------------+--------------------+--------------------+
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
I am expecting to see actual data, like this:
+------------------+------------------+--------------------+
| Id|PackSize| Name|
+------------------+------------------+--------------------+
|a0w1a0000003xB1A | 1.0| A |
|a0w1a0000003xAAI | 1.0| B |
|a0w1a00000xB3AAI | 30.0| C |
Can you please help me identify what I am doing wrong here.
My Python script is long, I am not sure it would be convenient for people to sift through it, so I posted only parts I am having issue with.
Thank a ton in advance!
Hey could you next time provide a working example. That would be easier.
The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.
>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]
So concerning your example you can create your desired output like this way:
# Your data at the moment
data = sc.parallelize([
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
])
# Convert to tuple
data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))
# Define schema
schema = StructType([
StructField("Id", StringType(), True),
StructField("Packsize", StringType(), True),
StructField("Name", StringType(), True)
])
# Create dataframe
DF = sqlContext.createDataFrame(data_converted, schema)
# Output
DF.show()
+----------------+--------+----+
| Id|Packsize|Name|
+----------------+--------+----+
|a0w1a0000003xB1A| 1.0| A|
|a0w1a0000003xAAI| 1.0| B|
|a0w1a00000xB3AAI| 30.0| C|
+----------------+--------+----+
Hope this helps

Resources