column not present in pyspark dataframe? - apache-spark

I have pyspark dataframe df having IP as column_name like below :
summary `0.0.0.0` 8.8.8.8 1.0.0.0 1.1.1.1
count 14 14 14 14
min 123 231 423 54
max 2344 241 555 100
When I am doing df.columns it is giving me a below column list but in list special character of 1st column back quote is missing.
[0.0.0.0, 8.8.8.8 ,1.0.0.0,1.1.1.1]
And when I am performing any operation using this list it gives me an error column 0.0.0.0 not present in dataframe.
Also, I tried to change column_name by using the below code but is not changing because it is not in the list.
import re
df = df.select([F.col(col).alias(re.sub("[`]+","",i)) for col in df.columns])
How to resolve this issue?
Schema of the df is like below after performing df.printSchema()
root
|-- summary: string (nullable = true)
|-- 0.0.0.0: string (nullable = true)
|-- 8.8.8.8: string (nullable = true)
|-- 1.0.0.0: string (nullable = true)
|-- 1.1.1.1: string (nullable = true)

With numbers as the first character of the column name, you always can force adding backticks when query from it
df.select('summary', '`0.0.0.0`').show()
# +-------+-------+
# |summary|0.0.0.0|
# +-------+-------+
# | count| 14|
# | min| 123|
# | max| 2344|
# +-------+-------+
df.select(['summary'] + [f'`{col}`' for col in df.columns if col != 'summary']).show()
# +-------+-------+-------+-------+-------+
# |summary|0.0.0.0|8.8.8.8|1.0.0.0|1.1.1.1|
# +-------+-------+-------+-------+-------+
# | count| 14| 14| 14| 14|
# | min| 123| 231| 423| 54|
# | max| 2344| 241| 555| 100|
# +-------+-------+-------+-------+-------+

Related

Copying column name as dictionary key in all values of column in Pyspark dataframe

I have pyspark df, distributed across the cluster as follows:
Name ID
A 1
B 2
C 3
I want to modify 'ID' column to make all values as python dictionaries with column name as key & value as existing values in column as follows:
Name TRACEID
A {ID:1}
B {ID:2}
C {ID:3}
How do I achieve this using pyspark code ? I need an efficient solution since it's a big volume distributed df across the cluster.
Thanks in advance.
You can first construct a struct from the ID column, and then use the to_json function to convert it to the desired format.
df = df.select('Name', F.to_json(F.struct(F.col('ID'))).alias('TRACEID'))
You can use the create_map function
from pyspark.sql.functions import col, lit, create_map
sparkDF.withColumn("ID_dict", create_map(lit("id"),col("ID"))).show()
# +----+---+---------+
# |Name| ID| ID_dict|
# +----+---+---------+
# | A| 1|{id -> 1}|
# | B| 2|{id -> 2}|
# | C| 3|{id -> 3}|
# +----+---+---------+
Rename/drop columns:
df = sparkDF.withColumn("ID_dict",create_map("id",col("ID"))).drop(col("ID")).withColumnRenamed("ID_dict", "ID")
df.show()
# +----+---------+
# |Name| ID|
# +----+---------+
# | A|{id -> 1}|
# | B|{id -> 2}|
# | C|{id -> 3}|
# +----+---------+
df.printSchema()
# root
# |-- Name: string (nullable = true)
# |-- ID: map (nullable = false)
# | |-- key: string
# | |-- value: long (valueContainsNull = true)
You get a column with map datatype that's well suited for representing a dictionary.

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.
>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?
The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+
You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Spark DataFrame making column null value to empty

I have joined two data frames with left outer join. Resulting data frame has null values. How do to make them as empty instead of null.
| id|quantity|
+---+--------
| 1| null|
| 2| null|
| 3| 0.04
And here is the schema
root
|-- id: integer (nullable = false)
|-- quantity: double (nullable = true)
expected output
| id|quantity|
+---+--------
| 1| |
| 2| |
| 3| 0.04
You cannot make them "empty", since they are double values and empty string "" is a String. The best you can do is leave them as nulls or set them to 0 using fill function:
val df2 = df.na.fill(0.,Seq("quantity"))
Otherwise, if you really want to have empty quantities, you should consider changing quantity column type to String.

Resources