I have a dataframe that looks like this:
df = spark.sql("""
SELECT list
FROM categories
""")
df.show()
list
1,1,1,2,2,apple
apple,orange,1,2
And I want result something like this
list
frequency_count
1
4
2
3
apple
2
orange
1
This is what I tried.
count_df = df.withColumn('count', F.size(F.split('list', ',')))
count_df.show(truncate=False)
df.createOrReplaceTempView('tmp')
freq_sql = """
select list,count(*) count from
(select explode(flatten(collect_list(split(list, ',')))) list
from tmp)
group by list
"""
freq_df = spark.sql(freq_sql)
freq_df.show(truncate=False)
And I'm getting this error
AnalysisException: cannot resolve 'split(df.`list`, ',', -1)' due to
data type mismatch: argument 1 requires string type, however,
'df.`list`' is of array<string> type.;
You are currently trying to flatten a single list type value ,however flatten function for an array expects an array of arrays:
flatten(arrayOfArrays) - Transforms an array of arrays into a single array.
Hence the resultant error you are facing
You would need to explode the list before you try to split it and finally exploding it to transform the elements as rows and finally groupBy to get the required count
Data Preparation
sparkDF = sql.createDataFrame(
[
(["1,1,1,2,2,apple"],),
(["apple,orange,1,1"],),
],
("list",)
)
Explode
sparkDF.createOrReplaceTempView("dataset")
sql.sql("""
SELECT
explode(list) as exploded
,list
FROM dataset
""").printSchema()
root
|-- exploded: string (nullable = true)
|-- list: array (nullable = true)
| |-- element: string (containsNull = true)
+----------------+------------------+
| exploded| list|
+----------------+------------------+
| 1,1,1,2,2,apple| [1,1,1,2,2,apple]|
|apple,orange,1,1|[apple,orange,1,1]|
+----------------+------------------+
Group By
sql.sql("""
SELECT
exploded
,count(*) as count
FROM (
SELECT
EXPLODE(SPLIT(list,",")) as exploded
FROM (
SELECT
EXPLODE(list) as list
FROM dataset
)
)
GROUP BY 1
""").show()
+--------+-----+
|exploded|count|
+--------+-----+
| orange| 1|
| apple| 2|
| 1| 5|
| 2| 2|
+--------+-----+
Related
I have pyspark df, distributed across the cluster as follows:
Name ID
A 1
B 2
C 3
I want to modify 'ID' column to make all values as python dictionaries with column name as key & value as existing values in column as follows:
Name TRACEID
A {ID:1}
B {ID:2}
C {ID:3}
How do I achieve this using pyspark code ? I need an efficient solution since it's a big volume distributed df across the cluster.
Thanks in advance.
You can first construct a struct from the ID column, and then use the to_json function to convert it to the desired format.
df = df.select('Name', F.to_json(F.struct(F.col('ID'))).alias('TRACEID'))
You can use the create_map function
from pyspark.sql.functions import col, lit, create_map
sparkDF.withColumn("ID_dict", create_map(lit("id"),col("ID"))).show()
# +----+---+---------+
# |Name| ID| ID_dict|
# +----+---+---------+
# | A| 1|{id -> 1}|
# | B| 2|{id -> 2}|
# | C| 3|{id -> 3}|
# +----+---+---------+
Rename/drop columns:
df = sparkDF.withColumn("ID_dict",create_map("id",col("ID"))).drop(col("ID")).withColumnRenamed("ID_dict", "ID")
df.show()
# +----+---------+
# |Name| ID|
# +----+---------+
# | A|{id -> 1}|
# | B|{id -> 2}|
# | C|{id -> 3}|
# +----+---------+
df.printSchema()
# root
# |-- Name: string (nullable = true)
# |-- ID: map (nullable = false)
# | |-- key: string
# | |-- value: long (valueContainsNull = true)
You get a column with map datatype that's well suited for representing a dictionary.
I have a data frame with a string column and I want to create multiple columns out of it.
Here is my input data and pagename is my string column
I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -
How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -
scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]
scala> df.show(false)
+---+-----------------+
|id |pagename |
+---+-----------------+
|2 |a:101 b:501 c:201|
|1 |a:100 b:500 c:200|
+---+-----------------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- pagename: string (nullable = false)
Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.
You can use str_to_map, explode the resulting map and pivot:
val df2 = df.select(
col("id"),
expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))
df2.show
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1|100|500|200|
| 2|101|501|201|
+---+---+---+---+
So two options immediately come to mind
Delimiters
You've got some obvious delimiters that you can split on. For this use the split function
from pyspark.sql import functions as F
delimiter = ":"
df = df.withColumn(
"split_column",
F.split(F.col("pagename"), delimiter)
)
# "split_column" is now an array, so we need to pull items out the array
df = df.withColumn(
"a",
F.col("split_column").getItem(0)
)
Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.
Regex
As the format is pretty fixed, you can do the same thing with a regex.
import re
regex_pattern = r"a\:() b\:() c\:()"
match_groups = ["a", "b", "c"]
for i in range(re.compile(regex_pattern).groups):
df = df.withColumn(
match_groups[i],
F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
)
CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)
Dataframe schema:
root
|-- ID: decimal(15,0) (nullable = true)
|-- COL1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: string (containsNull = true)
Sample data
+--------------------+--------------------+--------------------+
| COL1 | COL2 | COL3 |
+--------------------+--------------------+--------------------+
|[A, B, C, A] |[101, 102, 103, 104]|[P, Q, R, S] |
+--------------------+--------------------+--------------------+
I want to apply nested conditions on array elements.
For example,
Find COL3 elements where COL1 elements are A and COL2 elements are even.
Expected Output : [S]
I looked at various functions. For e.g. - array_position but it returns only the first occurrence.
Is there any straightforward way or I have to explode arrays?
Assuming your condition applies to array elements with the same index, it is possible to filter arrays with lambda functions in SQL since Spark 2.4.0, but this is still not exposed via the other language APIs and you need to use expr(). You simply zip the three arrays and then filter the resulting array of structs:
scala> df.show()
+---+------------+--------------------+------------+
| ID| COL1| COL2| COL3|
+---+------------+--------------------+------------+
| 1|[A, B, C, A]|[101, 102, 103, 104]|[P, Q, R, S]|
+---+------------+--------------------+------------+
scala> df.select($"ID", expr(s"""
| filter(
| arrays_zip(COL1, COL2, COL3),
| e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
| ).COL3 AS result
| """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
Since this uses expr() to supply an SQL expression as a column, it also works with PySpark:
>>> from pyspark.sql.functions import expr
>>> df.select(df.ID, expr("""
... filter(
... arrays_zip(COL1, COL2, COL3),
... e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
... ).COL3 AS result
... """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.
Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
You can define an empty array of specific type as:
fill = array().cast("array<string>")
and combine it with when clause:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
or coalesce:
topics_a = coalesce(col("topics_A"), fill)
and use it as:
df.withColumn("topics_A", topics_a)
so with example data:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
the result will be:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+
I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)
+-----+
|title|
+-----+
+-----+
+-----+
|title|
+-----+
+-----+
root
|-- title: string (nullable = true)
+--------------------+
| title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows
I have joined two data frames with left outer join. Resulting data frame has null values. How do to make them as empty instead of null.
| id|quantity|
+---+--------
| 1| null|
| 2| null|
| 3| 0.04
And here is the schema
root
|-- id: integer (nullable = false)
|-- quantity: double (nullable = true)
expected output
| id|quantity|
+---+--------
| 1| |
| 2| |
| 3| 0.04
You cannot make them "empty", since they are double values and empty string "" is a String. The best you can do is leave them as nulls or set them to 0 using fill function:
val df2 = df.na.fill(0.,Seq("quantity"))
Otherwise, if you really want to have empty quantities, you should consider changing quantity column type to String.