Modify spark DataFrame column - apache-spark

I would like to change the following dataframe:
--id--rating--timestamp--
-------------------------
| 0 | 5.0 | 231312231 |
| 1 | 3.0 | 192312311 | #Epoch time (seconds from 1 Thursday, 1 January 1970)
-------------------------
to the following dataframe:
--id--rating--timestamp--
--------------------------
| 0 | 5.0 | 05 |
| 1 | 3.0 | 04 | #Month of year
--------------------------
How I can do that?

It's easy using built-in functions
import org.apache.spark.sql.functions._;
import spark.implicits._
val newDF = dataset.withColumn("timestamp", month(from_unixtime('timestamp)));
Note that DataFrames are immutable, so you can create new DataFrame but not modify. Of course you can assign this Dataset to the same variable.
Note number 2: DataFrame = Dataset[Row], that's why I use both names

If you coming from scala, you can use sql.functions methods inside Dataframe.select or Dataframe.withClumn methods, for your case I think the method month(e: Column): Column can perform the change you want. It will be something like that :
import org.apache.spark.sql.functions.month
df.withColumn("timestamp", month("timestamp") as "month")
I do believe that there's an equivalent way in Java, Python and R

Related

Check if timestamp is inside range

I'm trying to obtain the following:
+--------------------+
|work_time | day_shift|
+--------------------+
| 00:45:40 | No |
| 10:05:47 | Yes |
| 15:25:28 | Yes |
| 19:38:52 | No |
where I classify the "work_time" into "day_shift".
"Yes" - if the time falls between 09:00:00 and 18:00:00
"No" - otherwise
My "work_time" is in datetime format showing only the time. I tried the following, but I'm just getting "No" for everything.
df = df.withColumn('day_shift', when(df.work_time >= to_timestamp(lit('09:00:00'), 'HH:mm:ss') & df.work_time <= to_timestamp(lit('18:00:00'), 'Yes').otherwise('No'))
You can use Column class method between. It works for both, timestamps and strings in format "HH:mm:ss". Use this:
F.col("work_time").between("09:00:00", "18:00:00")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([('00:45:40',), ('10:05:47',), ('15:25:28',), ('19:38:52',)], ['work_time'])
day_shift = F.col("work_time").between("09:00:00", "18:00:00")
df = df.withColumn("day_shift", F.when(day_shift, "Yes").otherwise("No"))
df.show()
# +---------+---------+
# |work_time|day_shift|
# +---------+---------+
# | 00:45:40| No|
# | 10:05:47| Yes|
# | 15:25:28| Yes|
# | 19:38:52| No|
# +---------+---------+
First of all, spark doesn't have so-called "Time" data type, it only supports either TimestampType or DateType. Therefore, I believe the work_time in your dataframe is a string.
Secondly, when you check your func.to_timestamp(func.lit('09:00:00'), 'HH:mm:ss') in selection statement, it will show:
+--------------------------------+
|to_timestamp(09:00:00, HH:mm:ss)|
+--------------------------------+
|1970-01-01 09:00:00 |
+--------------------------------+
only showing top 1 row
The best way to achieve is either split your work_time column to hour, minute and second column respectively and do the filtering, or add a date value in your work_time column before any timestamp filtering.

Efficiently update rows of a postgres table from another table in another database based on a condition in a common column

I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.

Spark Dataframe complex ordering

I have a event log dataset, like this:
| patient | timestamp | event_st | extra_info |
| 1 | 1/1/2018 2:30 | urg_admission | x |
| 1 | 1/1/2018 3:00 | urg_discharge | x |
| 1 | 1/1/2018 | hosp_admission | y |
| 1 | 1/10/2018 | hosp_discharge | y |
I want to order all rows by patient and timestamp, but unfortunately, depending on the type of event event_st, the timestamp may be in minutes or days granularity.
So, the solution I would use in C++ would be define a complex < operator, where I would use the event_st as a discriminator when time granularity differs. For example, with the shown data, the events with hosp_ prefix will be always ordered after the events with urg_ prefix, when their day are the same.
Is there any equivalent approach using the DataFrame API or other Spark APIs?
Thank you very much.
One option is to first normalize all the timestamp to some standard form like ddMMYY or in epoch. The simplest way is to use an udf.
For example: If you consider all the timestamp to be converted to epoch then your code would look like:
def convertTimestamp(timeStamp:String, event_st:String) : Long = {
if(event_st == 'urg_admission') {
...// Add conversion logic
}
if(event_st == 'hosp_admission') {
...// Add conversion logic
}
...
}
val df = spark.read.json("/path/to/log/dataset") // I am assuming json format
spark.register.udf("convertTimestamp", convertTimestamp _)
df.createOrReplaceTempTable("logdataset")
val df_normalized = spark.sql("select logdataset.*, convertTimestamp(timestamp,event_st) as normalized_timestamp from logdataset")
After this you can use the normalized dataset form subsequent operation.

Find and remove matching column values in pyspark

I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this:
| Date | Latitude |
| 2017-01-01 | 43.4553 |
| 2017-01-02 | 42.9399 |
| 2017-01-03 | 43.0091 |
| 2017-01-04 | 2017-01-04 |
Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try
df['Date'].isin(['Latitude'])
I get:
Column<(Date IN (Latitude))>
Any suggestions?
If you're more comfortable with SQL syntax, here is an alternative way using a pyspark-sql condition inside the filter():
df = df.filter("Date NOT IN (Latitude)")
Or equivalently using pyspark.sql.DataFrame.where():
df = df.where("Date NOT IN (Latitude)")

Performance: Group by a subset of previous grouping columns

I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.

Resources