Iterate cols PySpark - python-3.x

I have a SQL table containing 40 columns: ID, Product, Product_ID, Date etc. and would like to iterate over all columns to get distinct values.
Customer table (sample):
ID Product
1 gadget
2 VR
2 AR
3 hi-fi
I have tried using dropDuplicates within a function that loops over all columns but the resultant output is only spitting out one distinct value per column instead of all possible distinct values.
Expected Result:
Column Value
ID 1
ID 2
ID 3
Product gadget
Product VR
Product AR
Product hi-fi
Actual Result:
Column Value
ID 1
Product gadget

The idea is to use collect_set() to fetch distinct elements in a column and then exploding the dataframe.
#All columns which need to be aggregated should be added here in col_list.
col_list = ['ID','Product']
exprs = [collect_set(x) for x in col_list]
Let's start aggregating.
from pyspark.sql.functions import lit , collect_set, explode, array, struct, col, substring, length, expr
df = spark.createDataFrame([(1,'gadget'),(2,'VR'),(2,'AR'),(3,'hi-fi')], schema = ['ID','Product'])
df = df.withColumn('Dummy',lit('Dummy'))
#While exploding later, the datatypes must be the same, so we have to cast ID as a String.
df = df.withColumn('ID',col('ID').cast('string'))
#Creating the list of distinct values.
df = df.groupby("Dummy").agg(*exprs)
df.show(truncate=False)
+-----+---------------+-----------------------+
|Dummy|collect_set(ID)|collect_set(Product) |
+-----+---------------+-----------------------+
|Dummy|[3, 1, 2] |[AR, VR, hi-fi, gadget]|
+-----+---------------+-----------------------+
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ['Dummy']).drop('Dummy')
df.show()
+--------------------+--------------------+
| key| val|
+--------------------+--------------------+
| collect_set(ID)| [3, 1, 2]|
|collect_set(Product)|[AR, VR, hi-fi, g...|
+--------------------+--------------------+
df = df.withColumn('val', explode(col('val')))
df = df.withColumnRenamed('key', 'Column').withColumnRenamed('val', 'Value')
df = df.withColumn('Column', expr("substring(Column,13,length(Column)-13)"))
df.show()
+-------+------+
| Column| Value|
+-------+------+
| ID| 3|
| ID| 1|
| ID| 2|
|Product| AR|
|Product| VR|
|Product| hi-fi|
|Product|gadget|
+-------+------+
Note: All the columns which are not strings, should be converted into String like df = df.withColumn('ID',col('ID').cast('string')). Otherwise, you will get error.

Related

Pyspark - Find sub-string from a column of data-frame with another data-frame

I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe
Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)
You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()

How to convert single String column to multiple columns based on delimiter in Apache Spark

I have a data frame with a string column and I want to create multiple columns out of it.
Here is my input data and pagename is my string column
I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -
How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -
scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]
scala> df.show(false)
+---+-----------------+
|id |pagename |
+---+-----------------+
|2 |a:101 b:501 c:201|
|1 |a:100 b:500 c:200|
+---+-----------------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- pagename: string (nullable = false)
Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.
You can use str_to_map, explode the resulting map and pivot:
val df2 = df.select(
col("id"),
expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))
df2.show
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1|100|500|200|
| 2|101|501|201|
+---+---+---+---+
So two options immediately come to mind
Delimiters
You've got some obvious delimiters that you can split on. For this use the split function
from pyspark.sql import functions as F
delimiter = ":"
df = df.withColumn(
"split_column",
F.split(F.col("pagename"), delimiter)
)
# "split_column" is now an array, so we need to pull items out the array
df = df.withColumn(
"a",
F.col("split_column").getItem(0)
)
Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.
Regex
As the format is pretty fixed, you can do the same thing with a regex.
import re
regex_pattern = r"a\:() b\:() c\:()"
match_groups = ["a", "b", "c"]
for i in range(re.compile(regex_pattern).groups):
df = df.withColumn(
match_groups[i],
F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
)
CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

How to create an unique autogenerated Id column in a spark dataframe

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset.
Because , I need to persist this dataframe with the autogenerated id , now if new data comes in the autogenerated id should not collide with the existing ones.
I checked the monotonically increasing function but it does not accept any offset .
This is what I tried :
df=df.coalesce(1);
df = df.withColumn(inputCol,functions.monotonically_increasing_id());
But is there a way to make the monotonically_increasing_id() start from a starting offset ?
You can simply add to it to provide a minimum value for the id. Note that it is not guaranteed the values will start from the minimum value
.withColumn("id", monotonically_increasing_id + 123)
Explanation: Operator + is overloaded for columns https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L642
Or if you don't want to restrict your program into one only partition with df.coalesce(1) you can use zipWithIndex which starts with index = 0 as next:
lines = [["a1", "a2", "a3"],
["b1", "b2", "b3"],
["c1", "c2", "c3"]]
cols = ["c1", "c2", "c3"]
df = spark.createDataFrame(lines, cols)
start_indx = 10
df = df.rdd.zipWithIndex() \
.map(lambda (r, indx): (indx + start_indx, r[0], r[1], r[2])) \
.toDF(["id", "c1", "c2", "c3"])
df.show(10, False)
In this case I set the start_index = 10. And this will be the output:
+---+---+---+---+
|id |c1 |c2 |c3 |
+---+---+---+---+
|10 |a1 |a2 |a3 |
|11 |b1 |b2 |b3 |
|12 |c1 |c2 |c3 |
+---+---+---+---+
You could add a rownumber to your columns and then add that to the maximum existing identity column, or your offset. Once it is set drop the rownumber attribute.
from pyspark.sql import functions as sf
from pyspark.sql.window import Window
# Could also grab the exist max ID value
seed_value = 123
df = df.withColumn("row_number", sf.rowNumber().over(Window.partitionBy(sf.col("natural_key")).orderBy(sf.col("anything"))))
df = df.withColumn("id", sf.col("row_number")+seed_value)
Remember to drop the row_number attribute.

Pyspark Replicate Row based on column value

I would like to replicate all rows in my DataFrame based on the value of a given column on each row, and than index each new row. Suppose I have:
Column A Column B
T1 3
T2 2
I want the result to be:
Column A Column B Index
T1 3 1
T1 3 2
T1 3 3
T2 2 1
T2 2 2
I was able to to something similar with fixed values, but not by using the information found on the column. My current working code for fixed values is:
idx = [lit(i) for i in range(1, 10)]
df = df.withColumn('Index', explode(array( idx ) ))
I tried to change:
lit(i) for i in range(1, 10)
to
lit(i) for i in range(1, df['Column B'])
and add it into my array() function:
df = df.withColumn('Index', explode(array( lit(i) for i in range(1, df['Column B']) ) ))
but it does not work (TypeError: 'Column' object cannot be interpreted as an integer).
How should I implement this?
Unfortunately you can't iterate over a Column like that. You can always use a udf, but I do have a non-udf hack solution that should work for you if you're using Spark version 2.1 or higher.
The trick is to take advantage of pyspark.sql.functions.posexplode() to get the index value. We do this by creating a string by repeating a comma Column B times. Then we split this string on the comma, and use posexplode to get the index.
df.createOrReplaceTempView("df") # first register the DataFrame as a temp table
query = 'SELECT '\
'`Column A`,'\
'`Column B`,'\
'pos AS Index '\
'FROM ( '\
'SELECT DISTINCT '\
'`Column A`,'\
'`Column B`,'\
'posexplode(split(repeat(",", `Column B`), ",")) '\
'FROM df) AS a '\
'WHERE a.pos > 0'
newDF = sqlCtx.sql(query).sort("Column A", "Column B", "Index")
newDF.show()
#+--------+--------+-----+
#|Column A|Column B|Index|
#+--------+--------+-----+
#| T1| 3| 1|
#| T1| 3| 2|
#| T1| 3| 3|
#| T2| 2| 1|
#| T2| 2| 2|
#+--------+--------+-----+
Note: You need to wrap the column names in backticks since they have spaces in them as explained in this post: How to express a column which name contains spaces in Spark SQL
You can try this:
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql import functions as F
df = spark.read.csv('/FileStore/tables/stack1.csv', header = 'True', inferSchema = 'True')
w = Window.orderBy("Column A")
df = df.select(row_number().over(w).alias("Index"), col("*"))
n_to_array = udf(lambda n : [n] * n ,ArrayType(IntegerType()))
df2 = df.withColumn('Column B', n_to_array('Column B'))
df3= df2.withColumn('Column B', explode('Column B'))
df3.show()

Merge two data frame with few different columns

I want to merge several DataFrames having few different columns.
Suppose ,
DataFrame A has 3 columns: Column_1, Column_2, Column 3
DataFrame B has 3 columns: Column_1, Columns_2, Column_4
DataFrame C has 3 Columns: Column_1, Column_2, Column_5
I want to merge these DataFrames such that I get a DataFrame like :
Column_1, Column_2, Column_3, Column_4 Column_5
number of DataFrames may increase. Is there any way to get this merge ? such that for a particular Column_1 Column_2 combination i get the values for other three columns in same row, and if for a particular combination of Column_1 Column_2 there is no data in some Columns then it should show null there.
DataFrame A:
Column_1 Column_2 Column_3
1 x abc
2 y def
DataFrame B:
Column_1 Column_2 Column_4
1 x xyz
2 y www
3 z sdf
The merge of A and B :
Column_1 Column_2 Column_3 Column_4
1 x abc xyz
2 y def www
3 z null sdf
If I understand your question correctly, you'll be needing to perform an outer join using a sequence of columns as keys.
I have used the data provided in your question to illustrate how it is done with an example :
scala> val df1 = Seq((1,"x","abc"),(2,"y","def")).toDF("Column_1","Column_2","Column_3")
// df1: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string]
scala> val df2 = Seq((1,"x","xyz"),(2,"y","www"),(3,"z","sdf")).toDF("Column_1","Column_2","Column_4")
// df2: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_4: string]
scala> val df3 = df1.join(df2, Seq("Column_1","Column_2"), "outer")
// df3: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string, Column_4: string]
scala> df3.show
// +--------+--------+--------+--------+
// |Column_1|Column_2|Column_3|Column_4|
// +--------+--------+--------+--------+
// | 1| x| abc| xyz|
// | 2| y| def| www|
// | 3| z| null| sdf|
// +--------+--------+--------+--------+
This is called an equi-join with another DataFrame using the given columns.
It is different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Note
Outer equi-joins are available since Spark 1.6.
First use following codes for all three data frames, so that SQL queries can be implemented on dataframes
DF1.createOrReplaceTempView("df1view")
DF2.createOrReplaceTempView("df2view")
DF3.createOrReplaceTempView("df3view")
then use this join command to merge
val intermediateDF = spark.sql("SELECT a.column1, a.column2, a.column3, b.column4 FROM df1view a leftjoin df2view b on a.column1 = b.column1 and a.column2 = b.column2")`
intermediateDF.createOrReplaceTempView("imDFview")
val resultDF = spark.sql("SELECT a.column1, a.column2, a.column3, a.column4, b.column5 FROM imDFview a leftjoin df3view b on a.column1 = b.column1 and a.column2 = b.column2")
these join can also be done together in one join, also since you want all values of column1 and column2,you can use full outer join instead of left join

Resources