Question: How can I convert a JSON string to DataFrame and also selecting only the keys I want?
I just started using Spark last week and I'm still learning so please bear with me.
I'm using Spark(2.4) Structured Streaming. The spark app get data (via socket) from a twitter streaming and data sent is full tweet JSON string. Below is a one of the DataFrames. Each row is the full JSON tweet.
+--------------------+
| value|
+--------------------+
|{"created_at":"Tu...|
|{"created_at":"Tu...|
|{"created_at":"Tu...|
+--------------------+
As Venkata suggested, I did this, translated to python (full codes below)
schema = StructType().add('created_at', StringType(), False).add('id_str', StringType(), False)
df = lines.selectExpr('CAST(value AS STRING)').select(from_json('value', schema).alias('temp')).select('temp.*')
This is the return value
+------------------------------+-------------------+
|created_at |id_str |
+------------------------------+-------------------+
|Wed Feb 20 04:51:18 +0000 2019|1098082646511443968|
|Wed Feb 20 04:51:18 +0000 2019|1098082646285082630|
|Wed Feb 20 04:51:18 +0000 2019|1098082646444441600|
|Wed Feb 20 04:51:18 +0000 2019|1098082646557642752|
|Wed Feb 20 04:51:18 +0000 2019|1098082646494797824|
|Wed Feb 20 04:51:19 +0000 2019|1098082646817681408|
+------------------------------+-------------------+
As can be seen, only the 2 keys that I wanted was included in the DataFrame.
Hope this would help any newbie.
Full codes
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StringType
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
sc = spark.sparkContext
lines = spark.readStream.format('socket').option('host', '127.0.0.1').option('port', 9999).load()
schema = StructType().add('created_at', StringType(), False).add('id_str', StringType(), False)
df = lines.selectExpr('CAST(value AS STRING)').select(from_json('value', schema).alias('temp')).select('temp.*')
query = df.writeStream.format('console').option('truncate', 'False').start()
# this part is only used to print out the query when running as an app. Not needed if using jupyter
import time
time.sleep(10)
lines.stop()
Here's a sample code snippet you can use to convert from json to DataFrame.
val schema = new StructType().add("id", StringType).add("pin",StringType)
val dataFrame= data.
selectExpr("CAST(value AS STRING)").as[String].
select(from_json($"value",schema).
alias("tmp")).
select("tmp.*")
Related
I have the below date time in string type. I want to convert that into UTC with offset
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
print("Print statement-1")
schema = StructType([
StructField("author", StringType(), False),
StructField("dt", StringType(), False)
])
data = [
["author1", "2022-07-22T09:25:47.261Z"],
["author2", "2022-07-22T09:26:47.291Z"],
["author3", "2022-07-22T09:23:47.411Z"],
["author4", "2022-07-224T09:25:47.291Z"]
]
df = spark.createDataFrame(data, schema)
I want to convert dt column as UTC with offset.
For example the first row value as 2022-07-22T09:25:47.2610000 +00:00
How to do that in pyspark and sparkSQL.
I can easily do that using regex_replace
df=df.withColumn("UTC",regexp_replace('dt', 'Z', '000 +00:00'))
bcoz Z is same as +00:00. But I am not sure that regexp_replace is correct of doing the conversion. Is there any method which can do the correct conversion rather than regex_replace?
I am relatively new to Pyspark. And for orchestration I use Databricks.
[Just FYI: My source Parquet holds a SCD Type 4 dataset where the Current Snapshot and History of it is maintained in a Single row, where the Current Snapshot is in Parquet individual Columns while the History Snapshot is within a Columns as a JSON Array.]
Believe my solution could be the one used in the below link and just want to expand that solution to work for me (I am not able to comment on that post as i believe my problem even same is different)
https://stackoverflow.com/questions/56409454/casting-a-column-to-json-dict-and-flattening-json-[values-in-a-column-in-pyspark/56409889#56409889][1]
Reference courtesies : #Gingerbread,#Kafels
And tried to use the resolution in that one, but getting some error
Here's how my dataframe looks like:
|HISTORY
|-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|[{"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"FSDF CC 10 ml ","DATE":"2021-06-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB CC ","DATE":"2021-03-11"}, {"HASH_KEY":"LulKYlm1qJaXFRq7oS1X1A==","SOURCE_KEY":"AAAAA","ATTR1":"BBB DD ","DATE":"2021-02-27"}]|
|[{"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAMES 50 ml ","DATE":"2021-03-02"}, {"HASH_KEY":"BK08ZMe/1UTHsenUAOMUwQ==","SOURCE_KEY":"BBBBB","ATTR1":"JAS 50 ml ","DATE":"2021-02-02"}] |
|null |
The DataFrame Schema is
root
|-- HISTORY: array (nullable = true)
| |-- element: string (containsNull = true)
Desired output is just to flattening JSON values in a column in pyspark
|HASH_KEY |SOURCE_KEY|DATE |ATTR1 |
|:-----------------------|:--------:|:--------:|---------------:|
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-06-11|FSDF CC 10 ml |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-03-11|BBB CC |
|LulKYlm1qJaXFRq7oS1X1A==|AAAAA |2021-02-27|BBB DD |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-03-02|JAMES 50 ml |
|BK08ZMe/1UTHsenUAOMUwQ==|BBBBB |2021-02-02|JAS 50 ml |
|CAsaZMe/1UTHsenUasasaW==|BBBBB |2021-09-11|null |
The code snippet i tried
schema = ArrayType(
StructType(
[
StructField("HASH_KEY1", StringType()),
StructField("SOURCE_KEY1", StringType()),
StructField("ATTR1X", StringType()),
StructField("DATE1", TimestampType())
]
)
)
#f.udf(returnType=schema)
def parse_col(column):
updated_values = []
for it in re.finditer(r'[.*?]', column):
parse = json.loads(it.group())
for key, values in parse.items():
for value in values:
value['HASH_KEY1'] = key
updated_values.append(value)
return updated_values
df = df \
.withColumn('tmp', parse_col(f.col('HISTORY'))) \
.withColumn('tmp', f.explode(f.col('tmp'))) \
.select(f.col('HASH_KEY'),
f.col('tmp').HASH_KEY1.alias('HASH_KEY1'),
f.col('tmp').SOURCE_KEY1.alias('SOURCE_KEY1'),
f.col('tmp').ATTR1X.alias('ATTR1X'),
f.col('tmp').DATE1.alias('DATE1'))
df.show()
The following is the result i got
|HASH_KEY1|SOURCE_KEY1|ATTR1X|DATE1|
|:-------:|:---------:|:----:|----:|
| | | | |
| | | | |
|:-------:|:---------:|:----:|----:|
I am having trouble in getting the expected output
Any help would be greatly appreciated. I am using Spark 2.0 + .
Thank you!
Undestood the usage of json_tuple and simplified my approach, where i can directly explode the array into a String and then use json_tuple function to convert into flattened columns
So answer snippet now looks as follow
from pyspark.sql.functions import col,json_tuple
DF_EXPLODE = df \
.withColumn('Expand', f.explode(f.col('HISTORY'))) \
.select(f.col('Expand'))
DF_FLATTEN =
DF_EXPLODE.select("*",json_tuple("Expand","HASH_KEY").alias("HASH_KEY")) \
.select("*",json_tuple("Expand","SOURCE_KEY").alias("SOURCE_KEY"))\
.select("*",json_tuple("Expand","DATE").alias("DATE"))\
.select("*",json_tuple("Expand","ATTR1").alias("ATTR1"))
Worked on my initial PySpark Looping approach and following is the code.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
DF_DIM2 = DF_DIM.withColumn("sizer",size(col('HISTORY'))).sort("sizer",ascending=False)
max_len = DF_DIM2.select('sizer').take(1)[0][0]
print(max_len)
expanded_df = DF_DIM.select(['*'] + [col('HISTORY')[i].alias(f'HISTORY_{i}') for i in range(max_len)])
original_cols = [i for i in expanded_df.columns if 'HISTORY_' not in i ]
cols_exp = [i for i in expanded_df.columns if 'HISTORY_' in i]
schema = StructType([
StructField("HASH_KEY",StringType(),True),
StructField("SOURCE_KEY",StringType(),True),
StructField("DATE", StringType(), True),
StructField("ATTR1",StringType(),True)
])
final_df = expanded_df.select([from_json(i,schema).alias(i) for i in cols_exp])
Did some use case testing where joined a 3.72Billion Fact Parquet with 390k Type4 Nested Dimension Parquet and it took 2.5 mins while the Explode option took over 4 mins.
The Explode Option is exploding each of Type4 records multiplied by the times dimension had its changes recorded in the History column. So on an averag if every dimension changed 10 times. Then 390k*10=3.9M records are used in memory to join with the fact leading to more processing times.
I have a case class:
case class clickStream(userid:String, adId :String, timestamp:String)
instance of which I wish to send with KafkaProducer as :
val record = new ProducerRecord[String,clickStream](
"clicktream",
"data",
clickStream(Random.shuffle(userIdList).head, Random.shuffle(adList).head, new Date().toString).toString
)
producer.send(record)
which sends record as string perfectly as expected in the TOPIC queue:
clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)
However, the problem is at Consumer end:
val clickStreamDF = spark.readStream
.format("kafka")
.options(kafkaMap)
.option("subscribe","clicktream")
.load()
clickStreamDF
.select($"value".as("string"))
.as[clickStream] //trying to leverage DataSet APIs conversion
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Apparently using .as[clickStream] API does not work as Exception is:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`userid`' given input columns: [value];
This is what [value] column contains :
Batch: 2
-------------------------------------------
+----------------------------------------------------+
|value |
+----------------------------------------------------+
|clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)|
+----------------------------------------------------+
I tried using Custom Serializer as value.serializer and value.deserializer
But facing a different issue of ClassNotFoundException in my directory structure.
I have 3 questions:
How Kafka uses Custom Deserializer class here to parse the object?
I do not fully understand the concept of Encoders and how that can be used in this case?
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
As you are passing clickStream object data as string to kafka & spark will read same string, In spark you have to parse & extract required fields from clickStream(user3,ad11,Sat Jul 18 20:59:35 IST 2020)
Check below code.
clickStreamDF
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream] # Extract all fields from the value string & then use .as[clickStream] option. I think this line is not required as data already parsed to required format.
.writeStream
.outputMode(OutputMode.Append())
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
Sample How to parse clickStream string data.
scala> df.show(false)
+---------------------------------------------------+
|value |
+---------------------------------------------------+
|clickStream(user5,ad2,Sat Jul 18 20:48:53 IST 2020)|
+---------------------------------------------------+
scala> df
.select(split(regexp_extract($"value","\\(([^)]+)\\)",1),"\\,").as("value"))
.select($"value"(0).as("userid"),$"value"(1).as("adId"),$"value"(2).as("timestamp"))
.as[clickStream]
.show(false)
+------+----+----------------------------+
|userid|adId|timestamp |
+------+----+----------------------------+
|user5 |ad2 |Sat Jul 18 20:48:53 IST 2020|
+------+----+----------------------------+
What will be the best approach to send/receive Custom Case Class Objects with Kafka?
Try to convert your case class to json or avro or csv then send message to kafka & read same message using spark.
How can I set a schema for a streaming DataFrame in PySpark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
# Import data types
from pyspark.sql.types import *
spark = SparkSession\
.builder\
.appName("StructuredNetworkWordCount")\
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:5560
lines = spark\
.readStream\
.format('socket')\
.option('host', '192.168.0.113')\
.option('port', 5560)\
.load()
For example I need a table like :
Name, lastName, PhoneNumber
Bob, Dylan, 123456
Jack, Ma, 789456
....
How can I set the header/schema to ['Name','lastName','PhoneNumber']
with their data types.
Also, Is it possible to display this table continuously, or say top 20 rows of the DataFrame. When I tried it I get the error
"pyspark.sql.utils.AnalysisException: 'Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;\nProject"
TextSocketSource doesn't provide any integrated parsing options. It is only possible to use one of the two formats:
timestamp and text if includeTimestamp is set to true with the following schema:
StructType([
StructField("value", StringType()),
StructField("timestamp", TimestampType())
])
text only if includeTimestamp is set to false with the schema as shown below:
StructType([StructField("value", StringType())]))
If you want to change this format you'll have to transform the stream to extract fields of interest, for example with regular expressions:
from pyspark.sql.functions import regexp_extract
from functools import partial
fields = partial(
regexp_extract, str="value", pattern="^(\w*)\s*,\s*(\w*)\s*,\s*([0-9]*)$"
)
lines.select(
fields(idx=1).alias("name"),
fields(idx=2).alias("last_name"),
fields(idx=3).alias("phone_number")
)
I am using PySpark through Spark 1.5.0.
I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?
Something that can eventually come along the lines of
df = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace
_ with - in the date half
and _ with : in the time part.
I was thinking I could split the column in 2 using substring and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way?
Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat.
zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).
I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()
Just want to add more resources and example into this discussion.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
For example, if your ts string is "22 Dec 2022 19:06:36 EST", then the format is "dd MMM yyyy HH:mm:ss zzz"