I got below Spark Data Frame.
I want to promote Row 1 as column Headings and the new spark DataFrame should be
I know this can be done in pandas easily as:
new_header = pandaDF.iloc[0]
pandaDF = pandaDF[1:]
pandaDF.columns = new_header
But doesn't want to convert into Pandas DF as have to persist this into to Database, wherein have to convert back pandas DF to Spark DF and then register as table and then write to db.
Try with .toDF and filter our the column values.
#sample dataframe
#| prop_0| prop_1| prop_2|
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
from pyspark.sql.functions import *
cols=sc.parallelize(cols).map(lambda x:x).collect()
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
I am using spark dataframes.
The task is this: to calculate and display in descending order the number of cities in the country grouped by country and region.
Initial data:
from pyspark.sql.functions import col
from pyspark.sql.functions import count
df = spark.read.json("/content/world-cities.json")
enter image description here
Desired result:
enter image description here
I get grouping only by the country column.
How to add grouping by second column subcountry?
enter image description here
If i understand you correctly you just need to add second column to your group by
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),("USA","usa-subcountry", "usa-city-2"),("USA","usa-subcountry-2", "usa-city"), ("Argentina","argentina-subcountry", "argentina-city")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("cnt"))\
Output is:
| country| subcountry|cnt|
| USA| usa-subcountry| 2|
| USA| usa-subcountry-2| 1|
|Argentina|argentina-subcountry| 1|
Edit: another try based on comment:
import pyspark.sql.functions as F
x = [("USA","usa-subcountry", "usa-city"),
("USA","usa-subcountry", "usa-city-2"),
("USA","usa-subcountry", "usa-city-3"),
("USA","usa-subcountry-2", "usa-city"),
("Argentina","argentina-subcountry", "argentina-city"),
("Argentina","argentina-subcountry-2", "argentina-city-2"),
("UK","UK-subcountry", "UK-city-1")]
df = spark.createDataFrame(x, schema=['country', 'subcountry', 'city'])
df.groupBy(F.col('country'), F.col('subcountry')).agg(F.count("*").alias("city_count"))\
.groupBy(F.col('country')).agg(F.count("*").alias("subcountry_count"), F.sum('city_count').alias("city_count"))\
| country|subcountry_count|city_count|
| USA| 2| 4|
|Argentina| 2| 2|
| UK| 1| 1|
I am assuming that cities and subcountries are unique, if not you may consider to use countDistinct instead of count
I have an initial PySpark dataframe from which I would like to take the MIN and MAX from a date column and then create a new PySpark dataframe with a timeseries (daily date), using the MIN and MAX from my initial dataframe.
I will use it to then join with my initial dataframe and find missing days (null in the rest of the column of my inital DF).
I tried in many different ways to build the timeseries DF, but it doesn't seem to work in PySpark. Any suggestions?
Max column's value can be extracted like this:
Date range df can be created like this:
df2 = spark.sql("SELECT sequence(to_date('2000-01-01'), to_date('2000-02-02'), interval 1 day) as date_col").withColumn('date_col', F.explode('date_col'))
And then join.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([(1, '2022-04-01'),(2, '2022-04-05')], ['id', 'df1_date']).select('id', F.col('df1_date').cast('date'))
# +---+----------+
# | id| df1_date|
# +---+----------+
# | 1|2022-04-01|
# | 2|2022-04-05|
# +---+----------+
min_date = df1.agg(F.min('df1_date')).head()[0]
max_date = df1.agg(F.max('df1_date')).head()[0]
df2 = spark.sql(f"SELECT sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day) as df2_date").withColumn('df2_date', F.explode('df2_date'))
df3 = df2.join(df1, df1.df1_date == df2.df2_date, 'left')
# +----------+----+----------+
# | df2_date| id| df1_date|
# +----------+----+----------+
# |2022-04-01| 1|2022-04-01|
# |2022-04-02|null| null|
# |2022-04-03|null| null|
# |2022-04-04|null| null|
# |2022-04-05| 2|2022-04-05|
# +----------+----+----------+
I have a CSV with headings that I'd like to save as Parquet (actually a delta table)
The column headings have spaces in them, which parquet can't handle. How do I change spaces to underscores?
This is what I have so far, cobbled together from other SO posts:
from pyspark.sql.functions import *
df = spark.read.option("header", True).option("delimiter","\u0001").option("inferSchema",True).csv("/mnt/landing/MyFile.TXT")
names = df.schema.names
for name in names:
df2 = df.withColumnRenamed(name,regexp_replace(name, ' ', '_'))
When I run this, the final line gives me this error:
TypeError: Column is not iterable
I thought this would be a common requirement given that parquet can't handle spaces but it's quite difficult to find any examples.
You need to use reduce function to iteratively apply renaming to the dataframe, because in your code df2 will have only the last column renamed...
The code would look as following (instead of for loop):
df2 = reduce(lambda data, name: data.withColumnRenamed(name, name.replace('1', '2')),
names, df)
You are getting exception because - function regexp_replace returns of type Column but function withColumnRenamed is excepting of type String.
def regexp_replace(e: org.apache.spark.sql.Column,pattern: String,replacement: String): org.apache.spark.sql.Column
def withColumnRenamed(existingName: String,newName: String): org.apache.spark.sql.DataFrame
Use .toDF (or) .select and pass list of columns to create new dataframe.
#| id|id a|id b|
#| 1| a| b|
#| 2| c| d|
new_cols=list(map(lambda x: x.replace(" ", "_"), df.columns))
df.select([col(s).alias(s.replace(' ','_')) for s in df.columns]).show()
#| id|id_a|id_b|
#| 1| a| b|
#| 2| c| d|
I have a Spark dataframe with two columns; src_edge and dest_edge. I simply want to create new spark dataframe so that it contains a single column id with values from src_edge and dest_edge.
src dst
1 2
1 3
I want to create df2 as:
If possible, I would also like to create df2 with no duplicates values. Does anyone have any idea how to do this?
The simplest thing may be to select each column, union them, and call distinct:
from pyspark.sql.functions import col
df2 = df.select(col("src").alias("id")).union(df.select(col("dst").alias("id"))).distinct()
#| id|
#| 1|
#| 3|
#| 2|
You can also accomplish this with an outer join:
df2 = df.select(col("src").alias("id"))\
Create a new column using array and explode to combine and flatten the two columns. Then, to remove duplicates use dropDuplicates:
from pyspark.sql.functions import array, explode
df2 = df.select(explode(array("src", "dst")).alias("id"))
I have data like the example data below. I’m trying to create a new column in my data using PySpark that would be the category of the first event for a customer based on the timestamp. Like the example output data below.
I have an example below of what I think would accomplish it using a window function in sql.
I’m pretty new to PySpark. I understand you can run sql inside of PySpark. I’m wondering if I have the code correct below to run the sql window function in PySpark. That is I’m wondering if I can just paste the sql code inside of spark.sql, as I have below.
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
window function example:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
You can use window function in pyspark as well
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>> df.show()
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>> df.show()
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|