I have a sample json datafile as shown below:
{"data_id":"1234","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":true,"familyname":true,"swimming_pool":true}}}
{"data_id":"6789","risk_characteristics":{"indicators":{"alcohol":true,"house":true,"business":false,"familyname":true}}}
{"data_id":"5678","risk_characteristics":{"indicators":{"alcohol":false}}}
I converted the json file to parquet and loaded into hive using below code
dataDF = spark.read.json("path/Datasmall.json")
dataDF.write.parquet("data.parquet")
parqFile = spark.read.parquet("data.parquet")
parqFile.write.saveAsTable("indicators_table", format='parquet', mode='append', path='/externalpath/indicators_table/')
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
fromHiveDF = hive_context.table("default.indicators_table")
fromHiveDF.show()
indicatorsDF = fromHiveDF.select('data_id', 'risk_characteristics.indicators')
indicatorsDF.printSchema()
root
|-- data_id: string (nullable = true)
|-- indicators: struct (nullable = true)
| |-- alcohol: boolean (nullable = true)
| |-- house: boolean (nullable = true)
| |-- business: boolean (nullable = true)
| |-- familyname: boolean (nullable = true)
indicatorsDF.show()
+-------+--------------------+
|data_id| indicators|
+-------+--------------------+
| 1234|[true, true, true...|
| 6789|[true, false, tru...|
| 5678| [false,,,,]|
+-------+--------------------+
Instead of retrieving the data as select data_id, indicators.alcohol, indicators.house etc,
I simply want to get a parquet data file with below 3 columns only . That is - the struct fields are converted to rows under indicators_type column name.
data_id indicators_type indicators_value
1234 alcohol T
1234 house T
1234 business T
1234 familyname T
1234 swimming_ppol T
6789 alcohol T
6789 house F
6789 business T
6789 familyname F
5678 alcohol F
May I ask how to do that. I am trying to accomplish this using pyspark. Also Is there a way to achieve this without hardcoding the literal details. In my actual data the struct data can extend beyond familyname and it could be even 100 of them.
Thanks a lot
Use stack to stack the columns:
df.show()
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|1234 |[true, true, false, true] |
|6789 |[true, false, true, false]|
+-------+--------------------------+
stack_expr = 'stack(' + str(len(df.select('indicators.*').columns)) + ', ' + ', '.join(["'%s', indicators.%s" % (col,col) for col in df.select('indicators.*').columns]) + ') as (indicators_type, indicators_value)'
df2 = df.selectExpr(
'data_id',
stack_expr
)
df2.show()
+-------+---------------+----------------+
|data_id|indicators_type|indicators_value|
+-------+---------------+----------------+
| 1234| alcohol| true|
| 1234| house| true|
| 1234| business| false|
| 1234| familyname| true|
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
+-------+---------------+----------------+
Another solution using explode:
val df = spark.sql(""" with t1(
select 1234 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
union
select 6789 data_id, named_struct('alcohol',true, 'house',false, 'business', true, 'familyname', false) as indicators
)
select * from t1
""")
df.show(false)
df.printSchema
+-------+--------------------------+
|data_id|indicators |
+-------+--------------------------+
|6789 |[true, false, true, false]|
|1234 |[true, false, true, false]|
+-------+--------------------------+
root
|-- data_id: integer (nullable = false)
|-- indicators: struct (nullable = false)
| |-- alcohol: boolean (nullable = false)
| |-- house: boolean (nullable = false)
| |-- business: boolean (nullable = false)
| |-- familyname: boolean (nullable = false)
val df2 = df.withColumn("x", explode(array(
map(lit("alcohol") ,col("indicators.alcohol")),
map(lit("house"), col("indicators.house")),
map(lit("business"), col("indicators.business")),
map(lit("familyname"), col("indicators.familyname"))
)))
df2.select(col("data_id"),map_keys(col("x"))(0), map_values(col("x"))(0)).show
+-------+--------------+----------------+
|data_id|map_keys(x)[0]|map_values(x)[0]|
+-------+--------------+----------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+--------------+----------------+
Update-1:
To get the indicators struct columns dynamically, use the below approach.
val colsx = df.select("indicators.*").columns
colsx: Array[String] = Array(alcohol, house, business, familyname)
val exp1 = colsx.map( x => s""" map("${x}", indicators.${x}) """ ).mkString(",")
val exp2 = " explode(array( " + exp1 + " )) "
val df2 = df.withColumn("x",expr(exp2))
df2.select(col("data_id"),map_keys(col("x"))(0).as("indicator_key"), map_values(col("x"))(0).as("indicator_value")).show
+-------+-------------+---------------+
|data_id|indicator_key|indicator_value|
+-------+-------------+---------------+
| 6789| alcohol| true|
| 6789| house| false|
| 6789| business| true|
| 6789| familyname| false|
| 1234| alcohol| true|
| 1234| house| false|
| 1234| business| true|
| 1234| familyname| false|
+-------+-------------+---------------+
I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?
threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)
The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.
>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+
In Pyspark 2.2 I am essentially trying to add rows by user.
If I have my main Dataframe that looks like:
main_list = [["a","bb",5], ["d","cc",10],["d","bb",11]]
main_pd = pd.DataFrame(main_list, columns = ['user',"group", 'value'])
main_df = spark.createDataFrame(main_pd)
main_df.show()
+----+-----+-----+
|user|group|value|
+----+-----+-----+
| a| bb| 5|
| d| cc| 10|
| d| bb| 11|
+----+-----+-----+
I then have a key Dataframe where I would like to have every user have every group value
User d has a row for group bb and cc. I would like user a to have the same.
key_list = [["bb",10],["cc",17]]
key_pd = pd.DataFrame(key_list, columns = ['group', 'value'])
key_df = spark.createDataFrame(key_pd)
main_df.join(key_df, ["group"], how ="outer").show()
But my result returns:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| bb| d| 11| 10|
+-----+----+-----+-----+
Here are the schemas of each Dataframe:
main_df.printSchema()
root
|-- user: string (nullable = true)
|-- group: string (nullable = true)
|-- value: long (nullable = true)
key_df.printSchema()
root
|-- group: string (nullable = true)
|-- value: long (nullable = true)
Essentially I would like the result to be:
+-----+----+-----+-----+
|group|user|value|value|
+-----+----+-----+-----+
| cc| d| 10| 17|
| bb| a| 5| 10|
| cc| a| Null| 17|
| bb| d| 11| 10|
+-----+----+-----+-----+
I don't think the full outer join will accomplish this with a coalesce so I had also experimented with row_number/rank
Get all the user-group combinations with a cross join, then use a left join on the maind_df to generate missing rows and then left join the result with key_df.
users = main_df.select("user").distinct()
groups = main_df.select("group").distinct()
user_group = users.crossJoin(groups)
all_combs = user_group.join(main_df, (main_df.user == user_group.user) & (main_df.group == user_group.group), "left").select(user_group.user,user_group.group,main_df.value)
all_combs.join(key_df, key_df.group == all_combs.group, "left").show()
I have a dataframe like this
data = [(("ID1", {'A': 1, 'B': 2}))]
df = spark.createDataFrame(data, ["ID", "Coll"])
df.show()
+---+----------------+
| ID| Coll|
+---+----------------+
|ID1|[A -> 1, B -> 2]|
+---+----------------+
df.printSchema()
root
|-- ID: string (nullable = true)
|-- Coll: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
I want to explode the 'Coll' column such that
+---+-----------+
| ID| Key| Value|
+---+-----------+
|ID1| A| 1|
|ID1| B| 2|
+---+-----------+
I am trying to do this in pyspark
I am successful if I use only one column, however I want the ID column as well
df.select(explode("Coll").alias("x", "y")).show()
+---+---+
| x| y|
+---+---+
| A| 1|
| B| 2|
+---+---+
Simply add the ID column to the select and it should work:
df.select("id", explode("Coll").alias("x", "y"))