Spark: how to remove unnecessary characters in df column values - apache-spark

I am having df like this,
+----+---+
| _c0|_c1|
+----+---+
|('a'| 2)|
|('b'| 4)|
|('c'| 6)|
+----+---+
I want like below how to do,
+----+---+
| _c0|_c1|
+----+---+
| a | 2 |
| b | 4 |
| c | 6 |
+----+---+
If I try like this getting an error
df1.select(regexp_replace('_c0', "('", "c")).show()
An error occurred while calling o789.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 71.0 failed 1 times, most recent failure: Lost task
1.0 in stage 71.0 (TID 184, localhost, executor driver): java.util.regex.PatternSyntaxException: Unclosed group near index 2

Like the other user has said it is necessary to escape special characters like brackets with a backslash. Here you can find a list of regex special characters. The following code uses two different approaches for your problem. With regexp_extract we extract the single character between (' and ' in column _c0. With regexp_replace we replace ) in the second column. You can of course use only the regexp_replace function with the regex "[()']" to achieve what you wanted. I just want to show you two different ways how you could tackle the problem.
from pyspark.sql import functions as F
columns = ['_c0', '_c1']
vals = [("('a'", "2)"),("('b'", "4)"),("('c'", "6)")]
df = spark.createDataFrame(vals, columns)
df = df.select(F.regexp_extract('_c0', "\('(\w)'", 1).alias('_c0')
, F.regexp_replace("_c1", "\)", "").alias('_c1'))
df.show()
Output:
+---+---+
|_c0|_c1|
+---+---+
| a| 2|
| b| 4|
| c| 6|
+---+---+

You should escape the brackets:
df1.select(regexp_replace('_c0', "\\('", "c")).show()

Related

PySpark: Create column with when and contains/isin

I'm using pyspark on a 2.X Spark version for this.
I have 2 sql dataframes, df1 and df2. df1 is an union of multiple small dfs with the same header names.
df1 = (
df1_1.union(df1_2)
.union(df1_3)
.union(df1_4)
.union(df1_5)
.union(df1_6)
.union(df1_7)
.distinct()
)
df2 does not have the same header names.
What i'm trying to achieve is to create a new column and to fill it with 2 values depending on a condition. But the condition would be something like if in the column of df1 you contain an element of an column of df2 then write A else B
So I tried something like this:
df1 = df1.withColumn(
"new_col",
when(df1["ColA"].substr(0, 4).contains(df2["ColA_a"]), "A").otherwise(
"B"
),
)
Every fields are string types.
I tried also using isin but the error is the same.
note: substr(0, 4) is because in df1["ColA"] I only need 4 characters in my field to match df2["ColA_a"].
py4j.protocol.Py4JJavaError: An error occurred while calling o660.select. :
org.apache.spark.sql.AnalysisException: Resolved attribute(s) ColA_a#444 missing from
ColA#438,ColB#439 in operator !Project [Contains(ColA#438, ColA_a#444) AS contains(ColA, ColA_a)#451].;;
The solution I've read on the Internet that I tried:
Cloning dfs
Collecting df and create new df (here we lose the performance of spark, and that's very sad)
Renaming columns to have the same name, or different name. (ambiguous naming ?)
EDIT:
here is some input output as requested
df1
+-----+-----+-----+
| Col1| ColA| ColB|
+-----+-----+-----+
|value|3062x|value|
|value|2156x|value|
|value|3059x|value|
|value|3044x|value|
|value|2661x|value|
|value|2400x|value|
|value|1907x|value|
|value|4384x|value|
|value|4427x|value|
|value|2091x|value|
+-----+-----+-----+
df2
+------+------+
|ColA_a|ColB_b|
+------+------+
| 2156| GMVT7|
| 2156| JQL71|
| 2156| JZDSQ|
| 2050| GX8PH|
| 2050| G67CV|
| 2050| JFFF7|
| 2031| GCT5C|
| 2170| JN0LB|
| 2129| J2PRG|
| 2091| G87WT|
+------+------+
output
+-----+-----+-----+-------+
| Col1| ColA| ColB|new_col|
+-----+-----+-----+-------+
|value|3062x|value| B |
|value|2156x|value| A |
|value|3059x|value| B |
|value|3044x|value| B |
|value|2661x|value| B |
|value|2400x|value| B |
|value|1907x|value| B |
|value|4384x|value| B |
|value|4427x|value| B |
|value|2091x|value| A |
+-----+-----+-----+-------+
You can use rlike join, to determine if the value exists in other column
df1=sqlContext.createDataFrame([
('value',3062,'value'),
('value',2156,'value'),
('value',3059,'value'),
('value',3044,'value'),
('value',2661,'value'),
('value',2400,'value'),
('value',1907,'value'),
('value',4384,'value'),
('value',4427,'value'),
('value',2091,'value')
],schema=['Col1', 'ColA', 'ColB'])
df2 =sqlContext.createDataFrame([
(2156, 'GMVT7'),
( 2156, 'JQL71'),
( 2156, 'JZDSQ'),
( 2050, 'GX8PH'),
( 2050, 'G67CV'),
( 2050, 'JFFF7'),
( 2031, 'GCT5C'),
( 2170, 'JN0LB'),
( 2129, 'J2PRG'),
( 2091, 'G87WT')],schema=['ColA_a','ColB_b'])
#%%
df_join = df1.join(df2.select('ColA_a').distinct(),F.expr("""ColA rlike ColA_a"""),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
df_fin.show()
+-----+----+-----+------+-------+
| Col1|ColA| ColB|ColA_a|new_col|
+-----+----+-----+------+-------+
|value|3062|value| null| B|
|value|2156|value| 2156| A|
|value|3059|value| null| B|
|value|3044|value| null| B|
|value|2661|value| null| B|
|value|2400|value| null| B|
|value|1907|value| null| B|
|value|4384|value| null| B|
|value|4427|value| null| B|
|value|2091|value| 2091| A|
+-----+----+-----+------+-------+
If you don't prefer rlike join, you can use the isin() method in your join.
df_join = df1.join(df2.select('ColA_a').distinct(),F.col('ColA').isin(F.col('ColA_a')),how = 'left')
df_fin = df_join.withColumn("new_col",F.when(F.col('ColA_a').isNull(),'B').otherwise('A'))
The results will be the same

How to combine and sort different dataframes into one?

Given two dataframes, which may have completely different schemas, except for a index column (timestamp in this case), such as df1 and df2 below:
df1:
timestamp | length | width
1 | 10 | 20
3 | 5 | 3
df2:
timestamp | name | length
0 | "sample" | 3
2 | "test" | 6
How can I combine these two dataframes into one that would look something like this:
df3:
timestamp | df1 | df2
| length | width | name | length
0 | null | null | "sample" | 3
1 | 10 | 20 | null | null
2 | null | null | "test" | 6
3 | 5 | 3 | null | null
I am extremely new to spark, so this might not actually make a lot of sense. But the problem I am trying to solve is: I need to combine these dataframes so that later I can convert each row to a given object. However, they have to be ordered by timestamp, so when I write these objects out, they are in the correct order.
So for example, given the df3 above, I would be able to generate the following list of objects:
objs = [
ObjectType1(timestamp=0, name="sample", length=3),
ObjectType2(timestamp=1, length=10, width=20),
ObjectType1(timestamp=2, name="test", length=6),
ObjectType2(timestamp=3, length=5, width=3)
]
Perhaps combining the dataframes does not make sense, but how could I sort the dataframes individually and somehow grab the Rows from each one of them ordered by timestamp globally?
P.S.: Note that I repeated length in both dataframes. That was done on purpose to illustrate that they may have columns of same name and type, but represent completely different data, so merging schema is not a possibility.
what you need is a full outer join, possibly renaming one of the columns, something like df1.join(df2.withColumnRenamed("length","length2"), Seq("timestamp"),"full_outer")
See this example, built from yours (just less typing)
// data shaped as your example
case class t1(ts:Int, width:Int,l:Int)
case class t2(ts:Int, width:Int,l:Int)
// create data frames
val df1 = Seq(t1(1,10,20),t1(3,5,3)).toDF
val df2 = Seq(t2(0,"sample",3),t2(2,"test",6)).toDF
df1.join(df2.withColumnRenamed("l","l2"),Seq("ts"),"full_outer").sort("ts").show
+---+-----+----+------+----+
| ts|width| l| name| l2|
+---+-----+----+------+----+
| 0| null|null|sample| 3|
| 1| 10| 20| null|null|
| 2| null|null| test| 6|
| 3| 5| 3| null|null|
+---+-----+----+------+----+

Spark SQL , doesn´t respect the Dataframe format

I m analyzing Twitter Files with the scope to take the trending topic, in json format with Spark SQL
After to take all the text form a Tweet and split the words, my dataFrame look like this
+--------------------+--------------------+
| line| words|
+--------------------+--------------------+
|[RT, #ONLYRPE:, #...| RT|
|[RT, #ONLYRPE:, #...| #ONLYRPE:|
|[RT, #ONLYRPE:, #...| #tlrp|
|[RT, #ONLYRPE:, #...| followan?|
I just need the column words, I coconvert my table to a temView.
df.createOrReplaceTempView("Twitter_test_2")
With the help of spark sql should be very easy to take the trending topic, I just need a query in sql using in the where condition operator "Like". words like "#%"
spark.sql("select words,
count(words) as count
from words_Twitter
where words like '#%'
group by words
order by count desc limit 10").show(20,False)
but I m getting some strange results that I can't find an explanation for them.
+---------------------+---+
|words |cnt|
+---------------------+---+
|#izmirescort |211|
|#PRODUCE101 |101|
|#VeranoMTV2017 |91 |
|#سلمان_يدق_خشم_العايل|89 |
|#ALDUBHomeAgain |67 |
|#BTS |32 |
|#سود_الله_وجهك_ياتميم|32 |
|#NowPlaying |32 |
for some reason the #89 and the #32 the twoo thar have arab characteres are no where they should been. The text had been exchanged with the counter.
others times I am confrontig tha kind of format.
spark.sql("select words, lang,count(words) count from Twitter_test_2 group by words,lang order by count desc limit 10 ").show()
After that Query to my dataframe, it look like so strange
+--------------------+----+-----+
| words|lang|count|
+--------------------+----+-----+
| #VeranoMTV2017| pl| 6|
| #umRei| pt| 2|
| #Virgem| pt| 2|
| #rt
2| pl| 2|
| #rt
gazowaną| pl| 1|
| #Ziobro| pl| 1|
| #SomosPorto| pt| 1|
+--------------------+----+-----+
Why is happening that, and how can avoid it ?

MySQL sum over a window that contains a null value returns null

I am trying to get the sum of Revenue over the last 3 Month rows (excluding the current row) for each Client. Minimal example with current attempt in Databricks:
cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
['A',201702,101],
['A',201703,102],
['A',201704,103],
['A',201705,104],
['B',201701,201],
['B',201702,np.nan],
['B',201703,203],
['B',201704,204],
['B',201705,205],
['B',201706,206],
['B',201707,207]
])
df_pd.columns = cols
spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql
""")
df_out.show()
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| NaN| 201.0|
| B|201703| 203.0| NaN|
| B|201704| 204.0| NaN|
| B|201705| 205.0| NaN|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
As you can see, if a null value exists anywhere in the 3 month window, a null value is returned. I would like to treat nulls as 0, hence the ifnull attempt, but this does not seem to work. I have also tried a case statement to change NULL to 0, with no luck.
Just coalesce outside sum:
df_out = sqlContext.sql("""
select *, coalesce(sum(Revenue) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
from df_sql
""")
It is Apache Spark, my bad! (am working in Databricks and I thought it was MySQL under the hood). Is it too late to change the title?
#Barmar, you are right in that IFNULL() doesn't treat NaN as null. I managed to figure out the fix thanks to #user6910411 from here: SO link. I had to change the numpy NaNs to spark nulls. The correct code from after the sample df_pd is created:
spark_df = spark.createDataFrame(df_pd)
from pyspark.sql.functions import isnan, col, when
#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c
for c, t in spark_df.dtypes])
spark_df.createOrReplaceTempView('df_sql')
df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
order by Client,Month
rows between 3 preceding and 1 preceding)) as Total_Sum3
from df_sql order by Client,Month
""")
df_out.show()
which then gives the desired:
+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
| A|201701| 100.0| null|
| A|201702| 101.0| 100.0|
| A|201703| 102.0| 201.0|
| A|201704| 103.0| 303.0|
| A|201705| 104.0| 306.0|
| B|201701| 201.0| null|
| B|201702| null| 201.0|
| B|201703| 203.0| 201.0|
| B|201704| 204.0| 404.0|
| B|201705| 205.0| 407.0|
| B|201706| 206.0| 612.0|
| B|201707| 207.0| 615.0|
+------+------+-------+----------+
Is sqlContext the best way to approach this or would it be better / more elegant to achieve the same result via pyspark.sql.window?

How to replace a string in a column with other string from the same column

I have below dataframe.
id,code
1,GSTR
2,GSTR
3,NA
4,NA
5,NA
here GSTR may change it can be anything. i want to replace NA with other string that is present in the same column. 
In this case i want to replace NA with other string that is present in the column i.e GSTR. I tried to use UDFS but being an unknown string. I am not able to figure out.
Note: In this code column there will be only two strings. one will be "NA" and another can be anything in our case GSTR is another string
Expected output
1,GSTR
2,GSTR
3,GSTR
4,GSTR
5,GSTR
we can take the distinct string other than NA and use it,
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([(1,'GSTR'),(2,'GSTR'),(3,'NA'),(4,'NA'),(5,'NA')],['id','code'])
>>> df.show()
+---+----+
| id|code|
+---+----+
| 1|GSTR|
| 2|GSTR|
| 3| NA|
| 4| NA|
| 5| NA|
+---+----+
>>> rstr = df.where(df.code != 'NA')[['code']].first().code
>>> df.withColumn('code',F.lit(rstr)).show()
+---+----+
| id|code|
+---+----+
| 1|GSTR|
| 2|GSTR|
| 3|GSTR|
| 4|GSTR|
| 5|GSTR|
+---+----+
Hope this helps.

Resources