How to extract value of json when doing pyspark query - apache-spark

This is how the table look like
which I extract using the following command:
query="""
select
distinct
userid,
region,
json_data
from mytable
where
operation = 'myvalue'
"""
table=spark.sql(query)
Now, I wish to extract only value of msg_id in column json_data (which is a string column), with the following expected output:
How should I change the query in the above code to extract the json_data
Note:
The json format is not fix (i.e., may contains other fields), but the value I want to extract is always with msg_id.
I want to achieve during retrieval for efficiency reason, though I can retrieve the json_data and format it afterwards.

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType,StructField,StringType
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("json", StringType(), True)
])
data = [("a","b",'{"msg_id":"123","msg":"test"}'),("c","d",'{"msg_id":"456","column1":"test"}')]
df = spark.createDataFrame(data,schema)
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df2 = df.withColumn('parsed', from_json(col('json'), json_schema))
df2.createOrReplaceTempView("test")
spark.sql("select a,b,parsed.msg_id from test").show()```
OUTPUT >>>
+---+---+------+
| a| b|msg_id|
+---+---+------+
| a| b| 123|
| c| d| 456|
+---+---+------+

Instead of reading file to get schema you can specify schema using StructType and StructField syntax, or <> syntax or use schema_of_json as shown below:
df.show() #sampledataframe
#+------+------+-----------------------------------------+
#|userid|region|json_data |
#+------+------+-----------------------------------------+
#|1 |US |{"msg_id":123} |
#|2 |US |{"msg_id":123} |
#|3 |US |{"msg_id":123} |
#|4 |US |{"msg_id":123,"is_ads":true,"location":2}|
#|5 |US |{"msg_id":456} |
#+------+------+-----------------------------------------+
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([StructField("msg_id", LongType(), True),
StructField("is_ads", BooleanType(), True),
StructField("location", LongType(), True)])
#OR
schema= 'struct<is_ads:boolean,location:bigint,msg_id:bigint>'
#OR
schema= df.select(F.schema_of_json("""{"msg_id":123,"is_ads":true,"location":2}""")).collect()[0][0]
df.withColumn("json_data", F.from_json("json_data",schema))\
.select("userid","region","json_data.msg_id").show()
#+------+------+------+
#|userid|region|msg_id|
#+------+------+------+
#| 1| US| 123|
#| 2| US| 123|
#| 3| US| 123|
#| 4| US| 123|
#| 5| US| 456|
#+------+------+------+

Related

Count unique values for every row in PySpark

I have PySpark DataFrame:
from pyspark.sql.types import *
schema = StructType([
StructField("col1", StringType()),
StructField("col2", StringType()),
StructField("col3", StringType()),
StructField("col4", StringType()),
])
data = [("aaa", "aab", "baa", "aba"),
("aab", "aab", "abc", "daa"),
("aa", "bb", "cc", "dd"),
(1, "bbb", 2, 2)]
df = spark.createDataFrame(data=data, schema=schema)
I need to calculate the count of unique values in each row. I understand that it should be something like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType, udf
#udf(ArrayType(df.schema))
def substract_unique(row):
return len(set(row))
df = df.withColumn("test", substract_unique(row))
But I can't understand how to put the whole the row into UDF. All examples I've seen are about either one or some columns or about lambda functions for returning min, mean and max values.
It would be perfect if you can give any example or advice using pandas_udf or UDF.
Don't go for udf. It is slow when working with big data. As much as possible use native Spark functions. If not possible, try to create pandas_udf.
Native Spark approach:
from pyspark.sql import functions as F
df = df.withColumn("unique", F.size(F.array_distinct(F.array(df.columns))))
df.show()
# +----+----+----+----+------+
# |col1|col2|col3|col4|unique|
# +----+----+----+----+------+
# | aaa| aab| baa| aba| 4|
# | aab| aab| abc| daa| 3|
# | aa| bb| cc| dd| 4|
# | 1| bbb| 2| 2| 3|
# +----+----+----+----+------+
pandas_udf approach:
from pyspark.sql import functions as F
#F.pandas_udf('long')
def count_unique(d: pd.DataFrame) -> pd.Series:
return d.nunique(axis=1)
df = df.withColumn("unique", count_unique(F.struct(*df.columns)))
df.show()
# +----+----+----+----+------+
# |col1|col2|col3|col4|unique|
# +----+----+----+----+------+
# | aaa| aab| baa| aba| 4|
# | aab| aab| abc| daa| 3|
# | aa| bb| cc| dd| 4|
# | 1| bbb| 2| 2| 3|
# +----+----+----+----+------+
It was simple...
#udf()
def substract_unique(*values):
return len(set(values))
cols = df.columns
df = df.withColumn("unique",substract_unique(*cols))

Spark dataframe foreachPartition: sum the elements using pyspark

I am trying to partition spark dataframe and sum elements in each partition using pyspark. But I am unable to do this inside a called function "sumByHour". Basically, I am unable to access dataframe columns inside "sumByHour".
Basically, I am partitioning by "hour" column and trying to sum the elements based on "hour" partition. So expected output is: 6,15,24 for 0,1,2 hour respectively. Tried below with no luck.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
def sumByHour(ip):
print(ip)
pandasDF = pd.DataFrame({'hour': [0,0,0,1,1,1,2,2,2], 'numlist': [1,2,3,4,5,6,7,8,9]})
myschema = StructType(
[StructField('hour', IntegerType(), False),
StructField('numlist', IntegerType(), False)]
)
myDf = spark.createDataFrame(pandasDF, schema=myschema)
mydf = myDf.repartition(3, "hour")
myDf.foreachPartition(sumByHour)
I am able to solve this with "window.partitionBy". But I want to know if it can be solved by "foreachPartition".
Thanks in Advance,
Sri
Thanks for the code sample it made this easy. Here's a really simple example modifies you sumByHour code:
def sumByHour(ip):
mySum = 0
myPartition = ""
for x in ip:
mySum += x.numlist
myPartition = x.hour
myString = '{}_{}'.format(mySum, myPartition)
print(myString)
mydf = myDf.repartition(5,"hour") #wait 5 I wanted 3!!!
You get almost the expected result:
>>> mydf.foreachPartition(sumByHour)
0_
0_
24_2
6_0
15_1
>>>
You might ask why partition by '5' and not the '3'? Well turns out the hash formula used for 3 partitions has collision for (0,1) into the same partition and then has an empty partition.(Bad luck) So this will work but, you only want to use it on an array that will fit into memory.
You can use a Window to do that and add the sumByHour as a new column.
from pyspark.sql import functions, Window
w = Window.partitionBy("hour")
myDf = myDf.withColumn("sumByHour", functions.sum("numlist").over(w))
myDf.show()
+----+-------+---------+
|hour|numlist|sumByHour|
+----+-------+---------+
| 1| 4| 15|
| 1| 5| 15|
| 1| 6| 15|
| 2| 7| 24|
| 2| 8| 24|
| 2| 9| 24|
| 0| 1| 6|
| 0| 2| 6|
| 0| 3| 6|
+----+-------+---------+

Transform list in a dataframe (same row, different columns) in Pyspark

I got one list from a dataframe's column:
list_recs = [row[0] for row in df_recs.select("name").collect()]
The list looks like this:
Out[243]: ['COL-4560', 'D65-2242', 'D18-4751', 'D68-3303']
I want to transform it in a new dataframe, which value in one different column. I tried doing this:
from pyspark.sql import Row
rdd = sc.parallelize(list_recs)
recs = rdd.map(lambda x: Row(SKU=str(x[0]), REC_01=str(x[1]), REC_02=str(x[2]), REC_03=str(x[3])))#, REC_04=str(x[4]), REC_0=str(x[5])))
schemaRecs = sqlContext.createDataFrame(recs)
But the outcome I'm getting is:
+---+------+------+------+
|SKU|REC_01|REC_02|REC_03|
+---+------+------+------+
| C| O| L| -|
| D| 6| 5| -|
| D| 1| 8| -|
| D| 6| 8| -|
+---+------+------+------+
What I wanted:
+----------+-------------+-------------+-------------+
|SKU |REC_01 |REC_02 |REC_03 |
+----------+-------------+-------------+-------------+
| COL-4560| D65-2242| D18-4751| D68-3303|
+----------+-------------+-------------+-------------+
I've also tried spark.createDataFrame(lista_recs, StringType()) but got all the items in the same column.
Thank you in advance.
Define schema and use spark.createDataFrame()
list_recs=['COL-4560', 'D65-2242', 'D18-4751', 'D68-3303']
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([StructField("SKU", StringType(), True), StructField("REC_01", StringType(), True), StructField("REC_02", StringType(), True), StructField("REC_03", StringType(), True)])
spark.createDataFrame([list_recs],schema).show()
#+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03|
#+--------+--------+--------+--------+
#|COL-4560|D65-2242|D18-4751|D68-3303|
#+--------+--------+--------+--------+

Spark doesn't read columns with null values in first row

Below is the content in my csv file :
A1,B1,C1
A2,B2,C2,D1
A3,B3,C3,D2,E1
A4,B4,C4,D3
A5,B5,C5,,E2
So, there are 5 columns but only 3 values in the first row.
I read it using the following command :
val csvDF : DataFrame = spark.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.csv("file.csv")
And following is what i get using csvDF.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| A1| B1| C1|
| A2| B2| C2|
| A3| B3| C3|
| A4| B4| C4|
| A5| B5| C5|
+---+---+---+
How can i read all the data in all the columns?
Basically your csv-file isn't properly formatted in the sense that it doesn't have a equal number of columns in each row, which is required if you want to read it with spark.read.csv. However, you can instead read it with spark.read.textFile and then parse each row.
As I understand it, you do not know the number of columns beforehand, so you want your code to handle an arbitrary number of columns. To do this you need to establish the maximum number of columns in your data set, so you need two passes over your data set.
For this particular problem, I would actually go with RDDs instead of DataFrames or Datasets, like this:
val data = spark.read.textFile("file.csv").rdd
val rdd = data.map(s => (s, s.split(",").length)).cache
val maxColumns = rdd.map(_._2).max()
val x = rdd
.map(row => {
val rowData = row._1.split(",")
val extraColumns = Array.ofDim[String](maxColumns - rowData.length)
Row((rowData ++ extraColumns).toList:_*)
})
Hope that helps :)
You can read it as a dataset with only one column (for example by using another delimiter) :
var df = spark.read.format("csv").option("delimiter",";").load("test.csv")
df.show()
+--------------+
| _c0|
+--------------+
| A1,B1,C1|
| A2,B2,C2,D1|
|A3,B3,C3,D2,E1|
| A4,B4,C4,D3|
| A5,B5,C5,,E2|
+--------------+
Then you can use this answer to manually split your column in five, this will add null values when the element does not exist :
var csvDF = df.withColumn("_tmp",split($"_c0",",")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3"),
$"_tmp".getItem(3).as("col4"),
$"_tmp".getItem(4).as("col5")
)
csvDF.show()
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
| A1| B1| C1|null|null|
| A2| B2| C2| D1|null|
| A3| B3| C3| D2| E1|
| A4| B4| C4| D3|null|
| A5| B5| C5| | E2|
+----+----+----+----+----+
If the column dataTypes and number of columns are known then you can define schema and apply the schema while reading the csv file as dataframe. Below I have defined all five columns as stringType
val schema = StructType(Seq(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true),
StructField("col4", StringType, true),
StructField("col5", StringType, true)))
val csvDF : DataFrame = sqlContext.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(schema)
.csv("file.csv")
You should be getting dataframe as
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A1 |B1 |C1 |null|null|
|A2 |B2 |C2 |D1 |null|
|A3 |B3 |C3 |D2 |E1 |
|A4 |B4 |C4 |D3 |null|
|A5 |B5 |C5 |null|E2 |
+----+----+----+----+----+

Efficent Dataframe lookup in Apache Spark

I want to efficiently look up many IDs. What I have is a dataframe that looks like this dataframe df_source but with a couple of million records distributed to 10 Workers:
+-------+----------------+
| URI| Links_lists|
+-------+----------------+
| URI_1|[URI_8,URI_9,...|
| URI_2|[URI_6,URI_7,...|
| URI_3|[URI_4,URI_1,...|
| URI_4|[URI_1,URI_5,...|
| URI_5|[URI_3,URI_2,...|
+-------+----------------+
My first step would be to make an RDD out of df_source:
rdd_source = df_source.rdd
out of rdd_source I want to create an RDD that only contains the URIs with IDs. I do this like that:
rdd_index = rdd_source.map(lambda x: x[0]).zipWithUniqueId()
now I also .flatMap() the rdd_source in to an RDD that contains all relations. Until now only contained within the Links_list column.
rdd_relations = rdd_source.flatMap(lamda x: x)
now I transform both rdd_index and rdd_relations back into dataframes because I want to do joins and I think (I might be wrong on this) joins on dataframes are faster.
schema_index = StructType([
StructField("URI", StringType(), True),
StructField("ID", IntegerType(), True))
df_index = sqlContext.createDataFrame(rdd_index, schema=schema_index)
and
schema_relation = StructType([
StructField("URI", StringType(), True),
StructField("LINK", StringType(), True))
df_relations = sqlContext.createDataFrame(rdd_relations, schema=schema_relation )
The resulting dataframes should look like these two :
df_index:
+-------+-------+
| URI| ID|
+-------+-------+
| URI_1| 1|
| URI_2| 2|
| URI_3| 3|
| URI_4| 4|
| URI_5| 5|
+-------+-------+
df_relations:
+-------+-------+
| URI| LINK|
+-------+-------+
| URI_1| URI_5|
| URI_1| URI_8|
| URI_1| URI_9|
| URI_2| URI_3|
| URI_2| URI_4|
+-------+-------+
now to replace the long string URIs in the df_relations I will do joins on the df_index, the first join:
df_relations =\
df_relations.join(df_index, df_relations.URI == df_index.URI,'inner')\
.select(col(ID).alias(URI_ID),col('LINK'))
This should yield me a dataframe looking like this:
df_relations:
+-------+-------+
| URI_ID| LINK|
+-------+-------+
| 1| URI_5|
| 1| URI_8|
| 1| URI_9|
| 2| URI_3|
| 2| URI_4|
+-------+-------+
And the second join:
df_relations =\
df_relations.join(df_index, df_relations.LINK == df_index.URI,'inner')\
.select(col(URI_ID),col('ID').alias(LINK_ID))
this should result in the final dataframe the one I need. Looking like this
df_relations:
+-------+-------+
| URI_ID|LINK_ID|
+-------+-------+
| 1| 5|
| 1| 8|
| 1| 9|
| 2| 3|
| 2| 4|
+-------+-------+
where all URIs are replaced with IDs from df_index.
Is this the efficent way to look up the IDs for all URIs on both columns in the relation table, or is there a more effective way doing this?
I'm using Apache Spark 2.1.0 with Python 3.5
You do not need to use RDD for the operations you described. Using RDD can be very costly. Second you do not need to do two joins, you can do just one:
import pyspark.sql.functions as f
# add a unique id for each URI
withID = df_source.withColumn("URI_ID", f.monotonically_increasing_id())
# create a single line from each element in the array
exploded = withID.select("URI_ID", f.explode("Links_lists").alias("LINK")
linkID = withID.withColumnRenamed("URI_ID", "LINK_ID").drop("Links_lists")
joined= exploded.join(linkID, on=exploded.LINK==linkID.URI).drop("URI").drop("LINK")
Lastly,if linkID (which is basically df_source with a column replaced) is relatively small (i.e. can be fully contained in a single worker) you can broadcast it. add the following before the join:
linkID = f.broadcast(linkID)

Resources