I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)
Related
I have two pandas DataFrames:
df1 from database A with connection parameters {"host":"hostname_a","port": "5432", "dbname":"database_a", "user": "user_a", "password": "secret_a"}. The column key is the primary key.
df1:
| | key | create_date | update_date |
|---:|------:|:-------------|:--------------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 |
| 1 | 57248 | | 2018-01-21 |
| 2 | 57249 | 1992-12-22 | 2016-01-31 |
| 3 | 57250 | | 2015-01-21 |
| 4 | 57251 | 1991-12-23 | 2015-01-21 |
| 5 | 57262 | | 2015-01-21 |
| 6 | 57263 | | 2014-01-21 |
df2 from database B with connection parameters {"host": "hostname_b","port": "5433", "dbname":"database_b", "user": "user_b", "password": "secret_b"}. The column id is the primary key (these values are originally the same than the one in the column key in df1; it's only a renaming of the primary key column of df1).
df2:
| | id | create_date | update_date | user |
|---:|------:|:-------------|:--------------|:------|
| 0 | 57247 | 1976-07-29 | 2018-01-21 | |
| 1 | 57248 | | 2018-01-21 | |
| 2 | 57249 | 1992-12-24 | 2020-10-11 | klm |
| 3 | 57250 | 2001-07-14 | 2019-21-11 | ptl |
| 4 | 57251 | 1991-12-23 | 2015-01-21 | |
| 5 | 57262 | | 2015-01-21 | |
| 6 | 57263 | | 2014-01-21 | |
Notice that the row[2] and row[3] in df2 have more recent update_date values (2020-10-11 and 2019-21-11 respectively) than their counterpart in df1 (where id = key) because their creation_date have been modified (by the given users).
I would like to update rows (i.e. in concrete terms; create_date and update_date values) of df1 where update_date in df2 is more recent than its original value in df1 (for the same primary keys).
This is how I'm tackling this for the moment, using sqlalchemy and psycopg2 + the .to_sql() method of pandas' DataFrame:
import psycopg2
from sqlalchemy import create_engine
connector = psycopg2.connect(**database_parameters_dictionary)
engine = create_engine('postgresql+psycopg2://', creator=connector)
df1.update(df2) # 1) maybe there is something better to do here?
with engine.connect() as connection:
df1.to_sql(
name="database_table_name",
con=connection,
schema="public",
if_exists="replace", # 2) maybe there is also something better to do here?
index=True
)
The problem I have is that, according to the documentation, the if_exists argument can only do three things:
if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
Therefore, to update these two rows, I have to;
1) use .update() method on df1 using df2 as an argument, together with
2) replacing the whole table inside the .to_sql() method, which means "drop+recreate".
As the tables are really large (more than 500'000 entries), I have the feeling that this will need a lot of unnecessary work!
How could I efficiently update only those two newly updated rows? Do I have to generate some custom SQL queries to compares the dates for each rows and only take the ones that have really changed? But here again, I have the intuition that, looping through all rows to compare the update dates will take "a lot" of time. How is the more efficient way to do that? (It would have been easier in pure SQL if the two tables were on the same host/database but it's unfortunately not the case).
Pandas can't do partial updates of a table, no. There is a longstanding open bug for supporting sub-whole-table-granularity updates in .to_sql(), but you can see from the discussion there that it's a very complex feature to support in the general case.
However, limiting it to just your situation, I think there's a reasonable approach you could take.
Instead of using df1.update(df2), put together an expression that yields only the changed records with their new values (I don't use pandas often so I don't know this offhand); then iterate over the resulting dataframe and build the UPDATE statements yourself (or with the SQLAlchemy expression layer, if you're using that). Then, use the connection to DB A to issue all the UPDATEs as one transaction. With an indexed PK, it should be as fast as this would ever be expected to be.
BTW, I don't think df1.update(df2) is exactly correct - from my reading, that would update all rows with any differing fields, not just when updated_date > prev updated_date. But it's a moot point if updated_date in df2 is only ever more recent than those in df1.
All,
I am trying to read the file with multiple record types in spark, but have no clue how to do it.. Can someone point out, if there is a way to do it? or some existing packages? or some user git packages
the example below - where we have a text file with 2 separate ( it could be more than 2 ) record type :
00X - record_ind | First_name| Last_name
0-3 record_ind
4-10 firstname
11-16 lastname
============================
00Y - record_ind | Account_#| STATE | country
0-3 record_ind
4-8 Account #
9-10 STATE
11-15 country
input.txt
------------
00XAtun Varma
00Y00235ILUSA
00XDivya Reddy
00Y00234FLCANDA
sample output/data frame
output.txt
record_ind | x_First_name | x_Last_name | y_Account | y_STATE | y_country
---------------------------------------------------------------------------
00x | Atun | Varma | null | null | null
00y | null | null | 00235 | IL | USA
00x | Divya | Reddy | null | null | null
00y | null | null | 00234 | FL | CANDA
One way to achieve this is to load data as 'text'. Complete row will be loaded inside one column named 'value'. Now call a UDF which modifies each row based on condition and transform the data in way that all row follow same schema.
At last, use schema to create required dataframe and save in database.
I have multiple monthly datasets with 50 variables each. I need to append these datasets to create one single dataset. However, I also want to add the month's name to the corresponding records while appending such that I can see a new column in the final dataset which can be used to identify records belonging to a month.
Example:
Data 1: Monthly_file_201807
ID | customerCategory | Amount |
1 | home | 654.00 |
2 | corporate | 9684.65 |
Data 2: Monthly_file_201808
ID | customerCategory | Amount |
84 | SME | 985.29 |
25 | Govt | 844.88 |
On Appending, I want something like this:
ID | customerCategory | Amount | Month |
1 | home | 654.00 | 201807 |
2 | corporate | 9684.65 | 201807 |
84 | SME | 985.29 | 201808 |
25 | Govt | 844.88 | 201808 |
currently, I'm appending using following code:
List dsList = [
Data1Path,
Data2Path
].collect() {app.data.open(source:it)}
//concatenate all records into a single larger dataset
Dataset ds=app.data.create()
dsList.each(){
ds.prepareToAdd(it)
ds.addAll(it)
}
ds.save()
app.data.copy(in: ds, out: FinalAppendedDataPath)
I have used the standard append code, but unable to add that additional column with a fixed value of month in there. I don't want to loop through the data to create an additional column of "month", as my data is very large and I have multiple files.
I have a data frame with as shown below and want to insert this data into cassandra table
+---------+------+-----------+
| name | id | city |
+---------+------+-----------+
| sam | 123 | Atlanta |
| John | 456 | Texas |
+---------+------+-----------+
I am using below code but it inserts only last row.
df.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "tablename", "keyspace" -> "keyspace"))
.mode(saveMode = "Append").save(`)
How to insert a data frame into Cassandra in Scala?
The code you provided should insert all the rows into Cassandra. There are a few reasons that it might not.
That is actually all the data there is, df for some reason only
contains a single row.
There are multiple rows but they share the same Partition key,
this means that subsequent writes will overwrite the initial write.
An exception is being thrown, this should be obvious in the logs.
I am using Spark 2.1 with Cassandra (3.9) as data source. C* has a big table with 50 columns, which is not a good data model for my use case. so I created split tables for each of those sensors along with partition key and clustering key cols.
All sensor table
-----------------------------------------------------
| Device | Time | Sensor1 | Sensor2 | Sensor3 |
| dev1 | 1507436000 | 50.3 | 1 | 1 |
| dev2 | 1507436100 | 90.2 | 0 | 1 |
| dev1 | 1507436100 | 28.1 | 1 | 1 |
-----------------------------------------------------
Sensor1 table
-------------------------------
| Device | Time | value |
| dev1 | 1507436000 | 50.3 |
| dev2 | 1507436100 | 90.2 |
| dev1 | 1507436100 | 28.1 |
-------------------------------
Now I am using spark to copy data from old table to new ones.
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
df.createOrReplaceTempView("data")
query = ('''select device,time,sensor1 as value from data ''' )
vgDF = spark.sql(query)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="sensor1", keyspace="dataks")\
.save()
copying data one by one is taking a lot of time (2.1) hours for a single table. is there any way i can select * and create multiple df for each sensors and save at once ? (or even sequentially).
One issue in the code is the cache
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
Here I don't see how df is used multiple times apart from save. SO here cache is counter productive. You are reading the data, filter it and saving it to a separate cassandra table. Now the only action happening on the dataframe is the save and nothing else.
So there is no benefit from caching the data here. Removing the cache will give you some speed up.
To create multiple tables sequentially. I would suggest to use partitionBy and write the data first to HDFS as partitioned data w.r.t sensor and then write it back to cassandra.