Efficient way to partially update a record in Dataframe - apache-spark

I have a system which accumulates batch data in a snapshot.
Each record in a batch contains an unique_id and a version and multiple other columns.
Previously whenever in a new batch an unique_id comes with a version bigger than the version present in the snapshot the syetm used to replace the entire record and rewrite as a new record. This is typically a merge of two dataframe based on the version.
For example :
Snapshot: <Uid> <Version> <col1> <col2>
-----------------
A1 | 1 | ab | cd
A2 | 1 | ef | gh
New Batch: <Uid> <Version> <col1>
------------------
A3 | 1 | gh
A1 | 2 | hh
See here col2 is absent in the new batch
After Merge It will become,
<Uid> <Version> <col1> <col2>
------------------
A3 | 1 | gh | Null
A1 | 2 | hh | Null
A2 | 1 | ef | gh
Here the problem is even if the data for the col2 didn't come for the Uid A2 ; after the merge that column is replaced by a null value. So the older value of the column is lost.
Now, I want to replace only the column for which the data have come
i.e. expected output
<Uid> <Version> <col1> <col2>
------------------
A3 | 1 | gh | Null
A1 | 2 | hh | cd
A2 | 1 | ef | gh
See the A1 unique id the col2 value is intact.
Although if the batch has the record for A1 as
New Batch: <Uid> <Version> <col1> <col2>
------------------
A1 | 2 | hh | uu
The output will be
------------------
A1 | 2 | hh | uu
A2 | 1 | ef | gh
Here the entire record of A2 is replaced.
As per current system I am using spark and storing the data as parquet. I can tweak the Merge process to incorporate this change
However, I would like to know if this is an optimal process to store data for these use case.
I am evaluating Hbase and Hive ORC along with possible change I can make to the merge process.
Any suggestion will be highly appreciated.

As far as I understand, you need to use full outer join between snapshot and journal(delta) and then use coalesce, for instance:
def applyDeduplicatedJournal(snapshot: DataFrame, journal: DataFrame, joinColumnNames: Seq[String]): DataFrame = {
val joinExpr = joinColumnNames
.map(column => snapshot(column) === journal(column))
.reduceLeft(_ && _)
val isThereNoJournalRecord = joinColumnNames
.map(jCol => journal(jCol).isNull)
.reduceLeft(_ && _)
val selectClause = snapshot.columns
.map(col => when(isThereNoJournalRecord, snapshot(col)).otherwise(coalesce(journal(col), snapshot(col))) as col)
snapshot
.join(journal, joinExpr, "full_outer")
.select(selectClause: _*)
}
In this case you will merge snapshot with journal with fallback to snapshot value in case when journal has null value.
Hope it helps!

Related

Fixed Length file Reading Spark with multiple Records format in one

All,
I am trying to read the file with multiple record types in spark, but have no clue how to do it.. Can someone point out, if there is a way to do it? or some existing packages? or some user git packages
the example below - where we have a text file with 2 separate ( it could be more than 2 ) record type :
00X - record_ind | First_name| Last_name
0-3 record_ind
4-10 firstname
11-16 lastname
============================
00Y - record_ind | Account_#| STATE | country
0-3 record_ind
4-8 Account #
9-10 STATE
11-15 country
input.txt
------------
00XAtun Varma
00Y00235ILUSA
00XDivya Reddy
00Y00234FLCANDA
sample output/data frame
output.txt
record_ind | x_First_name | x_Last_name | y_Account | y_STATE | y_country
---------------------------------------------------------------------------
00x | Atun | Varma | null | null | null
00y | null | null | 00235 | IL | USA
00x | Divya | Reddy | null | null | null
00y | null | null | 00234 | FL | CANDA
One way to achieve this is to load data as 'text'. Complete row will be loaded inside one column named 'value'. Now call a UDF which modifies each row based on condition and transform the data in way that all row follow same schema.
At last, use schema to create required dataframe and save in database.

How to add rows to an existing partition in Spark?

I have to update historical data. By update, I mean adding new rows and sometimes new columns to an existing partition on S3.
The current partitioning is implemented by date: created_year={}/created_month={}/created_day={}. In order to avoid too many objects per partition, I do the following to maintain single object/partition:
def save_repartitioned_dataframe(bucket_name, df):
dest_path = form_path_string(bucket_name, repartitioned_data=True)
print('Trying to save repartitioned data at: {}'.format(dest_path))
df.repartition(1, "created_year", "created_month", "created_day").write.partitionBy(
"created_year", "created_month", "created_day").parquet(dest_path)
print('Data repartitioning complete with at the following location: ')
print(dest_path)
_, count, distinct_count, num_partitions = read_dataframe_from_bucket(bucket_name, repartitioned_data=True)
return count, distinct_count, num_partitions
A scenario exists where I have to add certain rows that have these columnar values:
created_year | created_month | created_day
2019 |10 |27
This means that the file(S3 object) at this path: created_year=2019/created_month=10/created_day=27/some_random_name.parquet will be appended with the new rows.
If there is a change in the schema, then all the objects will have to implement that change.
I tried looking into how this works generally, so, there are two modes of interest: overwrite, append.
The first one will just add the current data and delete the rest. I do not want that situation. The second one will append but may end up creating more objects. I do not want that situation either. I also read that dataframes are immutable in Spark.
So, how do I achieve appending the new data as it arrives to existing partitions and maintaining one object per day?
Based on your question I understand that you need to add new rows to the existing data while not increasing the number of parquet files. This can be achieved by doing operations on specific partition folders. There might be three cases while doing this.
1) New partition
This means the incoming data has a new value in the partition columns. In your case, this can be like:
Existing data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 1 |
New data
| year | month | day |
| ---- | ----- | --- |
| 2020 | 1 | 2 |
So, in this case, you can just create a new partition folder for the incoming data and save it as you did.
partition_path = "/path/to/data/year=2020/month=1/day=2"
new_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
2) Existing partition, new data
This is where you want to append new rows to the existing data. It could be like:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | b | 1 |
Here we have a new record for the same partition. You can use the "append mode" but you want a single parquet file in each partition folder. That's why you should read the existing partition first, union it with the new data, then write it back.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.unionByName(new_data)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
3) Existing partition, existing data
What if the incoming data is an UPDATE, rather than an INSERT? In this case, you should update a row instead of inserting a new one. Imagine this:
Existing data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 1 |
New data
| year | month | day | key | value |
| ---- | ----- | --- | --- | ----- |
| 2020 | 1 | 1 | a | 2 |
"a" had a value of 1 before, now we want it to be 2. So, in this case, you should read existing data and update existing records. This could be achieved like the following.
partition_path = "/path/to/data/year=2020/month=1/day=1"
old_data = spark.read.parquet(partition_path)
write_data = old_data.join(new_data, ["year", "month", "day", "key"], "outer")
write_data = write_data.select(
"year", "month", "day", "key",
F.coalesce(new_data["value"], old_data["value"]).alias("value")
)
write_data.repartition(1, "year", "month", "day").write.parquet(partition_path)
When we outer join the old data with the new one, there can be four things,
both data have the same value, doesn't matter which one to take
two data have different values, take the new value
old data doesn't have the value, new data has, take the new
new data doesn't have the value, old data has, take the old
To fulfill what we desire here, coalesce from pyspark.sql.functions will do the work.
Note that this solution covers the second case as well.
About schema change
Spark supports schema merging for the parquet file format. This means you can add columns to or remove from your data. As you add or remove columns, you will realize that some columns are not present while reading the data from the top level. This is because Spark disables schema merging by default. From the documentation:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
To able to read all columns, you need to set the mergeSchema option to true.
df = spark.read.option("mergeSchema", "true").parquet(path)

Add columns based on a changing number of rows below

I'm trying to solve a machine learning problem for an university project. As a input I got a excel table.
It is needed to access information below specific rows (condition: df[c1] !=0) and create new columns with it. But the number of rows after the specific row is not fixed.
There are various pandas functions I tried to get running (e.g.: While-Loops combined with iloc, iterrows.) But nothing seemed to work. Now I wonder if I need to create a function where I create a new df for every group below each top element. I asume there must be a better option. I use Python 3.6 and Pandas 0.25.0.
I try to get the following result.
Input:
| name | c1 | c2 |
|------|-------|--------------|
| ab | 1 | info |
| tz | 0 | more info |
| ka | 0 | more info |
| cd | 2 | info |
| zz | 0 | more info |
The output should look like this:
Output:
| name | c1 | c2 | tz3 | ka4 | zz5 |
|------|-------|--------------|-----------|-----------|------------|
| ab | 1 | info | more info | more info | |
| tz | 0 | more info | | | |
| ka | 0 | more info | | | |
| cd | 2 | info | | | more info |
| zz | 0 | more info | | | |
You can do this as follows:
# make sure c1 is of type int (if it isn't already)
# if it is string, just change the comparison further below
df['c1']= df['c1'].astype('int32')
# create two temporary aux columns in the original dataframe
# the first contains 1 for each row where c1 is nonzero
df['nonzero']= (df['c1'] != 0).astype('int')
# the second contains a "group index" to give
# all rows that belong together the same number
df['group']= df['nonzero'].cumsum()
# create a working copy from the original dataframe
df2= df[['c1', 'c2', 'group']].copy()
# add another column which contains the name of the
# column under which the text should appear
df2['col']= df['name'].where(df['nonzero']==0, 'c2')
# add a dummy column with all ones
# (needed to merge the original dataframe
# with the "transposed" dataframe later)
df2['nonzero']= 1
# now the main part
# use the prepared copy and index it on
# group, nonzero(1) and col
df3= df2[['group', 'nonzero', 'col', 'c2']].set_index(['group', 'nonzero', 'col'])
# unstack it, meaning col is "split off" to create a new column
# level (like pivoting), the rest remains in the index
df3= df3.unstack()
# now df3 has a multilevel column index
# to get rid of it and have regular column names
# just rename the columns and remove c2 which
# we get from the original dataframe
df3_names= ['{1}'.format(*tup) for tup in df3.columns]
df3.columns= df3_names
df3.drop(['c2'], axis='columns', inplace=True)
# df3 now contains the "transposed" infos in column c1
# which should appear in the row for which 'nonzero' contains 1
# to get this, use merge
result= df.merge(df3, left_on=['group', 'nonzero'], right_index=True, how='left')
# if you don't like the NaN values (for the rows with nonzero=0), use fillna
result.fillna('', inplace=True)
# remove the aux columns and the merged c2_1 column
# for c2_1 we can use the original c2 column from df
result.drop(['group', 'nonzero'], axis='columns', inplace=True)
# therefore we rename it to get the same naming schema
result.rename({'c2': 'c2_1'}, axis='columns', inplace=True)
The result looks like this:
Out[191]:
name c1 c2 ka tz zz
0 ab 1 info even more info more info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info more info
4 zz 0 more info
For this input data:
Out[166]:
name c1 c2
0 ab 1 info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info
4 zz 0 more info
# created by the following code:
import io
raw=""" name c1 c2
0 ab 1 info
1 tz 0 more_info
2 ka 0 even_more_info
3 cd 2 info
4 zz 0 more_info"""
df= pd.read_csv(io.StringIO(raw), sep='\s+', index_col=0)
df['c2']=df['c2'].str.replace('_', ' ')

Conditional Inner join in sqlite python

I have three tables a, b and c.
Table a is related with table b through column key.
table b is related with table c through columns word, sense and speech. In addition table c holds column id.
Now some rows in a.word have no matching value with b.word, based on that
I want to inner join tables on condition if a.word = b.word then join, otherwise compare only a.end_key = b.key.
As a result I want to have table in form of a with extra columns of start_id and end_id from c matching with key_start and key_end.
I tried following sql command with python:
CREATE TABLE relations
AS
SELECT * FROM
c
INNER JOIN
a
INNER JOIN
b
ON
a.end_key = b.key
AND
a.start_key = b.key
AND
b.word = c.word
AND
b.speech = c.speech
AND
b.sense = c.sense
OR
a.word = b.word
a:
+-----------+---------+------+-----------+
| key_start | key_end | word | relation |
+-----------+---------+------+-----------+
| k5 | k1 | tree | h |
| k7 | k2 | car | m |
| k200 | k3 | bad | ho |
+-----------+---------+------+-----------+
b:
+-----+------+--------+-------+
| key | word | speech | sense |
+-----+------+--------+-------+
| k5 | sky | a | 1 |
| k2 | car | a | 1 |
| k3 | bad | n | 2 |
+-----+------+--------+-------+
c:
+----+---------+--------+-------+
| id | word | speech | sense |
+----+---------+--------+-------+
| 0 | light | a | 1 |
| 0 | dark | b | 3 |
| 1 | neutral | a | 2 |
+----+---------+--------+-------+
Edit for clarification:
The values of tables a, b and c hold hundreds thousands lines, so there are matching values in the tables. Table a is related to table b with end_key ~ key and start_key~key relation. Table b is related to c through word sense and speech, there are values which match in each of these columns.
The desired table is in form
start_id|key_start|key_end|end_id|relation
Where start_id matches key_start and key_end matches end_id.
EDIT new answer
The problem with the proposed query lies in the use of AND's and OR's (and likely missing (...)). This statement
a.word = b.word then join, otherwise compare only a.end_key = b.key.
would translate to:
AND (a.word= b.word OR a.end_key = b.key).
Maybe try it like this:
ON
b.word = c.word
AND
b.speech = c.speech
AND
b.sense = c.sense
AND
(a.word = b.word OR a.end_key = b.key)
It would be a good idea to test in a sqlite manager (eg command line sqlite3, DB Browser for sqlite) before you try it in python; troubleshooting is much easier. And of course test the SELECT before you implement it in a CREATE TABLE.
You could clarify your question by showing the desired columns and result in relations table that this sample data would create (there is nothing between b and c that would match on word, speech, sense). Also the description of the relationship between a and b is confusing. In the first paragraph it says Table a is related with table b through column key. Should key be word?

Display all matching values in one comma separated cell

I have two columns of data in an Excel 2010 spreadsheet. In Column A is a category, and in Column B is a value. There will be multiple values in Column B for each unique category in Column A.
What I want to achieve in a separate sheet is to display all of the values for each each unique category in one comma (or semi-colon etc) separated cell.
For example, if my first sheet looks like this:
----------------------
| Category | Value |
----------------------
| Cat1 | Val A |
| Cat1 | Val B |
| Cat1 | Val C |
| Cat2 | Val D |
| Cat3 | Val E |
| Cat3 | Val F |
| Cat3 | Val G |
| Cat3 | Val H |
----------------------
I'd want to display the following in another sheet:
---------------------------------------
| Category | Value |
---------------------------------------
| Cat1 | Val A,Val B,Val C |
| Cat2 | Val D |
| Cat3 | Val E,Val F,Val G, Val H |
---------------------------------------
Can this be achieved with a formula? Vlookup will only find the first matching value, of course. I've Googled it, but the individual search terms involved in the query are so generic I'm getting swamped with inappropriate results.
Please try (in a copy on another sheet):
Insert a column on the left with =IF(B2<>B3,"","x") in A2 (assuming Category is in B1). In D2 put =IF(B1=B2,D1&", "&C2,C2) and copy both formulae down to suit. Copy and Paste Special Values over the top. Filter on ColumnA for x and delete selected rows. Unfilter and delete ColumnA.

Resources