sanitize column values in pyspark dataframe - apache-spark

Given CSV file, I converted to Dataframe using something code like the following.
raw_df = spark.read.csv(input_data, header=True)
That creates dataframe looks something like this:
| Name |
========
| 23 |
| hi2 |
| me3 |
| do |
I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:
| Name |
========
| 23 |
| 2 |
| 3 |
| do |
I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.

Yes, It's possible. You can use regex_replace from function.
Please check this:
import pyspark.sql.functions as f
df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])
df.show()
+----+
|name|
+----+
| 12|
| hi2|
| me3|
+----+
final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', ''))
final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
| 12| 12|
| hi2| 2|
| me3| 3|
+----+--------+
final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
| 12| 12| 2|
| hi2| 2| 1|
| me3| 3| 1|
+----+--------+---+
You can adjust regex.

Otherway doing the same. It's just an another way but better use spark inbuilt functions if available. as shown above also.
from pyspark.sql.functions import udf
import re
user_func = udf (lambda x: re.findall("\d+", x)[0])
newdf = df.withColumn('new_column',user_func(df.Name))
>>> newdf.show()
+----+----------+
|Name|new_column|
+----+----------+
| 23| 23|
| hi2| 2|
| me3| 3|
+----+----------+

Related

How to expand months in pyspark

I have data as below
+-----+---------+----------+
| TYPE|DTIN_MNTH|DTOUT_MNTH|
+-----+---------+----------+
| A| 2022-03| 2022-05|
| B| 2022-04| 2022-04|
| C| 2022-05| 2022-07|
+-----+---------+----------+
I want to expand the DTIN_MNTH and DTOUT_MNTH in rows as below,
+-----+---------+
| TYPE| MNTH|
+-----+---------+
| A| 2022-03|
| A| 2022-04|
| A| 2022-05|
| B| 2022-04|
| C| 2022-05|
| C| 2022-06|
| C| 2022-07|
+-----+---------+
From Spark 2.4, you can use sequence built-in function to generate array of months between two dates, then use explode to expand array of months. You can then reformat your expanded column to month format.
It gives you the following code, with input your input dataframe:
from pyspark.sql import functions as F
result = input.select(
F.col('TYPE'),
F.explode(
F.sequence(
F.to_timestamp(F.col('DTIN_MNTH'), 'yyyy-MM'),
F.to_timestamp(F.col('DTOUT_MNTH'), 'yyyy-MM'),
F.expr('INTERVAL 1 MONTH'))
).alias('MNTH')
).withColumn('MNTH', F.date_format(F.col('MNTH'), 'yyyy-MM'))

Equality Filter in Spark Structured Streaming [duplicate]

I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.
Heres my code
df.select(df("state")==="TX").show()
this returns the state column with boolean values instead of just TX
Ive also tried
df.select(df("state")=="TX").show()
but this doesn't work either.
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
You should be using where, select is a projection that returns the output of the statement, thus why you get boolean values. where is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
df.filter($"state" like "T%%") for pattern matching
df.filter($"state" === "TX") or df.filter("state = 'TX'") for equality
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note : import sqlContext.implicits._
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show() method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show():
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The === operator is defined in the Column class. The Column class doesn't define a == operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the === method takes an argument with an Any type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that === is a method defined in the Column class!
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()

Aggregation and feature extraction at the same time using pyspark

i have this dataset
+---------+------+------------------+--------------------+-------------+
| LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703| 4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484| 2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765| 0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632| 2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384| 1|
for every consumerID we have the sum and avg consumption in every month, acron grouped is fixed for each consumer
i want to aggregate according to the id and in the same time extract those new features and have round numbers to finally have this data
+---------+-------------+-------------------+------------------+------------------+------------------
| LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023| 4| | | |
|MAC000128| 2| | | |
|MAC000012| 0| | | |
|MAC000053| 2| | | |
|MAC000121| 1| | | |
You can do a pivot:
import pyspark.sql.functions as F
result = df.groupBy('LCLid', 'Acorn_grouped') \
.pivot('season') \
.agg(
F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'),
F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
).fillna(0) # replace nulls with zero -
# you can skip this if you want to keep nulls

how to combine rows in a data frame by id

I have a data frame:
+---------+---------------------+
| id| Name|
+---------+---------------------+
| 1| 'Gary'|
| 1| 'Danny'|
| 2| 'Christopher'|
| 2| 'Kevin'|
+---------+---------------------+
I need to combine all the Name values in the id column. Please tell me how to get from it:
+---------+------------------------+
| id| Name|
+---------+------------------------+
| 1| ['Gary', 'Danny']|
| 2| ['Kevin','Christopher']|
+---------+------------------------+
You can use groupBy and collect functions. Based on your need you can use list or set etc.
df.groupBy(col("id")).agg(collect_list(col("Name"))
in case you want duplicate values
df.groupBy(col("id")).agg(collect_set(col("Name"))
if you want unique values
Use groupBy and collect_list functions for this case.
from pyspark.sql.functions import *
df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False)
#+---+------------------------+
#|id |Name |
#+---+------------------------+
#|1 |['Gary', 'Danny'] |
#|2 |['Kevin', 'Christopher']|
#+---+------------------------+
df.groupby('id')['Name'].apply(list)

PySpark - Convert String to Array

I have a column in my dataframe that is a string with the value like ["value_a", "value_b"].
What is the best way to convert this column to Array and explode it? For now, I'm doing something like:
explode(split(col("value"), ",")).alias("value")
But I'm getting strings like ["awesome" or "John" or "Maria]" and the expected output should be awesome, John, Maria (one item per line, that why i'm using explode).
Sample code to reproduce:
sample_input = [
{"id":1,"value":"[\"johnny\", \"maria\"]"},
{"id":2,"value":"[\"awesome\", \"George\"]"}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(split(col("value"), ",")).alias("value")).show(n=10)
Output generated by code above:
+---+----------+
| id| value|
+---+----------+
| 1| ["johnny"|
| 1| "maria"]|
| 2|["awesome"|
| 2| "George"]|
+---+----------+
Expected should be:
+---+----------+
| id| value|
+---+----------+
| 1| johnny |
| 1| maria |
| 2| awesome|
| 2| George|
+---+----------+
Worked for me.
sample_input = [
{"id":1,"value":["johnny", "maria"]},
{"id":2,"value":["awesome", "George"]}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(col("value")).alias("value")).show(n=10)
+---+-------+
| id| value|
+---+-------+
| 1| johnny|
| 1| maria|
| 2|awesome|
| 2| George|
+---+-------+

Resources