I'm trying to convert my dataframe into JSON so that it can be pushed into ElasticSearch. Here's how my dataframe looks like:
Provider Market Avg. Deviation
XM NY 10 5
TL AT 8 6
LM CA 7 8
I want to have it like this:
Column
XM: {
NY: {
Avg: 10,
Deviation: 5
}
}
How can I create something like this?
Check below code, You can modify this as per your requirement.
scala> :paste
// Entering paste mode (ctrl-D to finish)
df
.select(
to_json(
struct(
map(
$"provider",
map(
$"market",
struct($"avg",$"deviation")
)
).as("json_data")
)
).as("data")
)
.select(get_json_object($"data","$.json_data").as("data"))
.show(false)
Output
+--------------------------------------+
|data |
+--------------------------------------+
|{"XM":{"NY":{"avg":10,"deviation":5}}}|
|{"TL":{"AT":{"avg":8,"deviation":6}}} |
|{"LM":{"CA":{"avg":7,"deviation":8}}} |
+--------------------------------------+
In case if any one want it done in pyspark way (Spark 2.0 +),
from pyspark import Row
from pyspark.sql.functions import get_json_object, to_json, struct,create_map
row = Row('Provider', 'Market', 'Avg', 'Deviation')
row_df = spark.createDataFrame(
[row('XM', 'NY', '10', '5'),
row('TL', 'AT', '8', '6'),
row('LM', 'CA', '7', '8')])
row_df.show()
row_df.select(
to_json(struct(
create_map(
row_df.Provider,
create_map(row_df.Market,
struct(row_df.Avg, row_df.Deviation)
)
)
)
).alias("json")
).select(get_json_object('json', '$.col1').alias('json')).show(truncate=False)
Output:
+--------+------+---+---------+
|Provider|Market|Avg|Deviation|
+--------+------+---+---------+
| XM| NY| 10| 5|
| TL| AT| 8| 6|
| LM| CA| 7| 8|
+--------+------+---+---------+
+------------------------------------------+
|json |
+------------------------------------------+
|{"XM":{"NY":{"Avg":"10","Deviation":"5"}}}|
|{"TL":{"AT":{"Avg":"8","Deviation":"6"}}} |
|{"LM":{"CA":{"Avg":"7","Deviation":"8"}}} |
+------------------------------------------+
Related
I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+
I have a tabular data with keys and values and the keys are not unique.
for example:
+-----+------+
| key | value|
--------------
| 1 | the |
| 2 | i |
| 1 | me |
| 1 | me |
| 2 | book |
| 1 |table |
+-----+------+
Now assume this table is distributed across the different nodes in spark cluster.
How do I use pyspark to calculate frequencies of the words with respect to the different keys? for instance, in the above example I wish to output:
+-----+------+-------------+
| key | value| frequencies |
---------------------------+
| 1 | the | 1/4 |
| 2 | i | 1/2 |
| 1 | me | 2/4 |
| 2 | book | 1/2 |
| 1 |table | 1/4 |
+-----+------+-------------+
Not sure if you can combine multi-level operations with DFs, but doing it in 2 steps and leaving concat to you, this works:
# Running in Databricks, not all stuff required
# You may want to do to upper or lowercase for better results.
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [("1", "the"), ("2", "I"), ("1", "me"),
("1", "me"), ("2", "book"), ("1", "table")]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1]))
df = sqlContext.createDataFrame(someschema)
df1 = df.groupBy("c1", "c2") \
.count()
df2 = df1.groupBy('c1') \
.sum('count')
df3 = df1.join(df2,'c1')
df3.show()
returns:
+---+-----+-----+----------+
| c1| c2|count|sum(count)|
+---+-----+-----+----------+
| 1|table| 1| 4|
| 1| the| 1| 4|
| 1| me| 2| 4|
| 2| I| 1| 2|
| 2| book| 1| 2|
+---+-----+-----+----------+
You can reformat last 2 cols, but am curious if we can do all in 1 go. In normal SQL we would use inline views and combine I suspect.
This works across cluster standardly, what Spark is generally all about. The groupBy takes it all into account.
minor edit
As it is rather hot outside, I looked into this in a little more depth. This is a good overview: http://stevendavistechnotes.blogspot.com/2018/06/apache-spark-bi-level-aggregation.html. After reading this and experimenting I could not get it any more elegant, reducing to 5 rows of output all in 1 go appears not to be possible.
Another viable option is with window functions.
First, define the number of occurrences per values-keys and for key. Then just add another column with the Fraction (you will have reduced fractions)
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
#udf (StringType())
def getFraction(frequency):
return str(Fraction(frequency))
schema = StructType([StructField("key", IntegerType(), True),
StructField("value", StringType(), True)])
data = [(1, "the"), (2, "I"), (1, "me"),
(1, "me"), (2, "book"), (1, "table")]
spark = SparkSession.builder.appName('myPython').getOrCreate()
input_df = spark.createDataFrame(data, schema)
(input_df.withColumn("key_occurrence",
F.count(F.lit(1)).over(Window.partitionBy(F.col("key"))))
.withColumn("value_occurrence", F.count(F.lit(1)).over(Window.partitionBy(F.col("value"), F.col('key'))))
.withColumn("frequency", getFraction(F.col("value_occurrence"), F.col("key_occurrence"))).dropDuplicates().show())
How to lower the case of column names of a data frame but not its values? using RAW Spark SQL and Dataframe methods ?
Input data frame (Imagine I have 100's of these columns in uppercase)
NAME | COUNTRY | SRC | CITY | DEBIT
---------------------------------------------
"foo"| "NZ" | salary | "Auckland" | 15.0
"bar"| "Aus" | investment | "Melbourne"| 12.5
taget dataframe
name | country | src | city | debit
------------------------------------------------
"foo"| "NZ" | salary | "Auckland" | 15.0
"bar"| "Aus" | investment | "Melbourne"| 12.5
If you are using scala, you can simply do the following
import org.apache.spark.sql.functions._
df.select(df.columns.map(x => col(x).as(x.toLowerCase)): _*).show(false)
And if you are using pyspark, you can simply do the following
from pyspark.sql import functions as F
df.select([F.col(x).alias(x.lower()) for x in df.columns]).show()
Java 8 solution to convert the column names to lower case.
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.Column;
df.select(Arrays.asList(df.columns()).stream().map(x -> col(x).as(x.toLowerCase())).toArray(size -> new Column[size])).show(false);
How about this:
Some fake data:
scala> val df = spark.sql("select 'A' as AA, 'B' as BB")
df: org.apache.spark.sql.DataFrame = [AA: string, BB: string]
scala> df.show()
+---+---+
| AA| BB|
+---+---+
| A| B|
+---+---+
Now re-select all columns with a new name, which is just their lower-case version:
scala> val cols = df.columns.map(c => s"$c as ${c.toLowerCase}")
cols: Array[String] = Array(AA as aa, BB as bb)
scala> val lowerDf = df.selectExpr(cols:_*)
lowerDf: org.apache.spark.sql.DataFrame = [aa: string, bb: string]
scala> lowerDf.show()
+---+---+
| aa| bb|
+---+---+
| A| B|
+---+---+
Note: I use Scala. If you use PySpark and are not familiar with the Scala syntax, then df.columns.map(c => s"$c as ${c.toLowerCase}") is map(lambda c: c.lower(), df.columns) in Python and cols:_* becomes *cols. Please note I didn't run this translation.
for Java 8
Dataset<Row> input;
for (StructField field : input.schema().fields()) {
String newName = field.name().toLowerCase(Locale.ROOT);
input = input.withColumnRenamed(field.name(), newName);
if (field.dataType() instanceof StructType) {
StructType newStructType = (StructType) StructType.fromJson(field.dataType().json().toLowerCase(Locale.ROOT));
input = input.withColumn(newName, col(newName).cast(newStructType));
}
}
You can use df.withColumnRenamed(col_name,col_name.lower()) for spark dataframe in python
I got this dataframe from a Kafka source.
+-----------------------+
| data |
+-----------------------+
| '{ "a": 1, "b": 2 }' |
+-----------------------+
| '{ "b": 3, "d": 4 }' |
+-----------------------+
| '{ "a": 2, "c": 4 }' |
+-----------------------+
I want to transform this into the following data frame:
+---------------------------+
| a | b | c | d |
+---------------------------+
| 1 | 2 | null | null |
+---------------------------+
| null | 3 | null | 4 |
+---------------------------+
| 2 | null | 4 | null |
+---------------------------+
Number of JSON fields may change, so I couldn’t specify a schema for it.
I pretty much got the idea how to do the transformation in spark batch, by using some map and reduce to get a set of JSON keys, then construct new dataframe by using withColumns.
However as far as I've been exploring, there is no map reduce function in structured streaming. How do I achieve this?
UPDATE
I figured out UDF can be utilized to parse string to JSON fields
import simplejson as json
from pyspark.sql.functions import udf
def convert_json(s):
return json.loads(s)
udf_convert_json = udf(convert_json, StructType(<..some schema here..>))
df = df.withColumn('parsed_data', udf_convert_json(df.data))
However since the schema is dynamic I need to get all JSON keys and values existed in df.data for a certain window period to construct a StructType used in udf return type.
In the end, I guess I need to know how to perform a reduce in dataset for a certain window period then use it as a lookup schema in stream transformation.
If you already know all unique keys in your json data, then we can use json_tuple function,
>>> df.show()
+------------------+
| data|
+------------------+
|{ "a": 1, "b": 2 }|
|{ "b": 3, "d": 4 }|
|{ "a": 2, "c": 4 }|
+------------------+
>>> from pyspark.sql import functions as F
>>> df.select(F.json_tuple(df.data,'a','b','c','d')).show()
+----+----+----+----+
| c0| c1| c2| c3|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField("a", StringType()),StructField("b", StringType()),StructField("c",StringType()),StructField("d", StringType())])
>>> df.select(F.from_json(df.data,schema).alias('data')).select(F.col('data.*')).show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
When you have a dynamic JSON column inside your Pyspark Dataframe, you can use below code to explode it's fields to columns
df2 = df.withColumn('columnx', udf_transform_tojsonstring(df.columnx))
columnx_jsonDF = spark.read.json(df2.rdd.map(lambda row: row.columnx)).drop('_corrupt_record')
df3 = df2.withColumn('columnx', from_json(col('columnx'),columnx_jsonDF.schema))
for c in set(columnx_jsonDF.columns):
df3 = df3.withColumn(f'columnx_{c}',df2[f'columnx.`{c}`'])
explanation:
First we use a UDF to transform our column into a valid JSON string (if it's not already done)
In line 3 we read our column as JSON Dataframe (with inferred schema)
then we read columnx again with from_json() function, passing columnx_jsonDF schema to it
finally we add a column in the main Dataframe for each key inside our JSON Column
This works if we don't know the JSON fields in advance and yet, need to explode it's columns
I guess, you don't need to do much. For e.g.
Bala:~:$ cat myjson.json
{ "a": 1, "b": 2 }
{ "b": 3, "d": 4 }
{ "a": 2, "c": 4 }
>>> df = sqlContext.sql("select * from json.`/Users/Bala/myjson.json`")
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
Can anyone please explain why case Row, Seq[Row] are used after the explode of a dataframe field which has collection of elements.
And also can you please explain me the reason why asInstanceOf is required to get the values from the exploded field?
Here is the syntax:
val explodedDepartmentWithEmployeesDF = departmentWithEmployeesDF.explode(departmentWithEmployeesDF("employees")) {
case Row(employee: Seq[Row]) =>
employee.map(employee =>
Employee(employee(0).asInstanceOf[String],
employee(1).asInstanceOf[String], employee(2).asInstanceOf[String]) ) }
First I will note, that I can not explain why your explode() turns into Row(employee: Seq[Row]) as I don't know the schema of your DataFrame. I have to assume it has to do with the structure of your data.
Not knowing you original data, I have created a small data set to work from
scala> val df = sc.parallelize( Array( (1, "dsfds dsf dasf dsf dsf d"), (2, "2344 2353 24 23432 234"))).toDF("id", "text")
df: org.apache.spark.sql.DataFrame = [id: int, text: string]
If I now map over it, you can se that it returns rows containing data of type Any.
scala> df.map {case row: Row => (row(0), row(1)) }
res21: org.apache.spark.rdd.RDD[(Any, Any)] = MapPartitionsRDD[17] at map at <console>:33
You have basically lost type information, which is why you need to explicitly specify the type when you want to use the data in the row
scala> df.map {case row: Row => (row(0).asInstanceOf[Int], row(1).asInstanceOf[String]) }
res22: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[18] at map at <console>:33
So, in order to explode it, I have to do the following
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.Row
df.explode(col("id"), col("text")) {case row: Row =>
val id = row(0).asInstanceOf[Int]
val words = row(1).asInstanceOf[String].split(" ")
words.map(word => (id, word))
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.Row
res30: org.apache.spark.sql.DataFrame = [id: int, text: string, _1: int, _2: string]
scala> res30 show
+---+--------------------+---+-----+
| id| text| _1| _2|
+---+--------------------+---+-----+
| 1|dsfds dsf dasf ds...| 1|dsfds|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| dasf|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| dsf|
| 1|dsfds dsf dasf ds...| 1| d|
| 2|2344 2353 24 2343...| 2| 2344|
| 2|2344 2353 24 2343...| 2| 2353|
| 2|2344 2353 24 2343...| 2| 24|
| 2|2344 2353 24 2343...| 2|23432|
| 2|2344 2353 24 2343...| 2| 234|
+---+--------------------+---+-----+
If you want named columns, you can define a case class to hold you exploded data
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.Row
case class ExplodedData(word: String)
df.explode(col("id"), col("text")) {case row: Row =>
val words = row(1).asInstanceOf[String].split(" ")
words.map(word => ExplodedData(word))
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.Row
defined class ExplodedData
res35: org.apache.spark.sql.DataFrame = [id: int, text: string, word: string]
scala> res35.select("id","word").show
+---+-----+
| id| word|
+---+-----+
| 1|dsfds|
| 1| dsf|
| 1| dasf|
| 1| dsf|
| 1| dsf|
| 1| d|
| 2| 2344|
| 2| 2353|
| 2| 24|
| 2|23432|
| 2| 234|
+---+-----+
Hope this brings some clearity.
I think you can read more about the document and do a test first.
explode of a dataframe still return a dataframe.And it accept a lambda function f: (Row) ⇒ TraversableOnce[A] as parameter.
in the lambda function, you will match the input by case. You've already known that your input will be Row of employee, which is still a Seq of Row.So the case of input will Row(employee: Seq[Row]) , if you don't understand this part, you can learn more thing about unapply funciton in scala.
And than, employee(I believe you should use employees here), as a Seq of Row, will apply the map function to map each row to a Employee. And you will use the scala apply function to get the i'th value in this row. But the return value is an Object , so you have to use asInstanceOf to convert it into the type you expected.