I have a 2-column dataframe such as:
col1 | col2
------------
a1 | b1
------------
a2 | b1
------------
a3 | b2
------------
a1 | b2
------------
a1 | b3
------------
I partition this dataframe using a random number generation:
df = df.withColumn("part", (rand() * num_partitions).cast("int"))
df.write.partitionBy("part").mode("overwrite").parquet("/address/")
However, with this partitioning, there is no guarantee that all rows where col1=a1 will be allocated in one partition. Is there any way to have this guarantee while partitioning the dataframe?
You can repartition the dataset on part such as repartition(num_partitions, "part"), this will reduce the skew along your partition column col1. After while writing you will specify col1 in the partitionBy expression.
df.write.partitionBy("col1").mode("overwrite").parquet("/address/")
Related
| Col1 | Col2 | Col3 |
|------|------|------|
| m | n | o |
| m | q | e |
| a | b | r |
Let's say I have a pandas DataFrame as shown above. Notice the col1 values are same for the 0th and 1st row. Is there way to find all the duplicate entries on the dataframe based on Col1 only.
Additionally i wold also like to add another column say is_duplicate which would say True for all the duplicate instances of my DataFrame and False otherwise.
Note: I want to find the duplicates based only on basis of the value in Col1 the other columuns can be or might not be duplicates, They should'nt be taken into consideration.
.duplicated() has exactly that functionality:
df['is_duplicate'] = df.duplicated('Col1')
I found it :
df["is_duplicate"] = df.Col1.duplicated(keep=False)
I have a system which accumulates batch data in a snapshot.
Each record in a batch contains an unique_id and a version and multiple other columns.
Previously whenever in a new batch an unique_id comes with a version bigger than the version present in the snapshot the syetm used to replace the entire record and rewrite as a new record. This is typically a merge of two dataframe based on the version.
For example :
Snapshot: <Uid> <Version> <col1> <col2>
-----------------
A1 | 1 | ab | cd
A2 | 1 | ef | gh
New Batch: <Uid> <Version> <col1>
------------------
A3 | 1 | gh
A1 | 2 | hh
See here col2 is absent in the new batch
After Merge It will become,
<Uid> <Version> <col1> <col2>
------------------
A3 | 1 | gh | Null
A1 | 2 | hh | Null
A2 | 1 | ef | gh
Here the problem is even if the data for the col2 didn't come for the Uid A2 ; after the merge that column is replaced by a null value. So the older value of the column is lost.
Now, I want to replace only the column for which the data have come
i.e. expected output
<Uid> <Version> <col1> <col2>
------------------
A3 | 1 | gh | Null
A1 | 2 | hh | cd
A2 | 1 | ef | gh
See the A1 unique id the col2 value is intact.
Although if the batch has the record for A1 as
New Batch: <Uid> <Version> <col1> <col2>
------------------
A1 | 2 | hh | uu
The output will be
------------------
A1 | 2 | hh | uu
A2 | 1 | ef | gh
Here the entire record of A2 is replaced.
As per current system I am using spark and storing the data as parquet. I can tweak the Merge process to incorporate this change
However, I would like to know if this is an optimal process to store data for these use case.
I am evaluating Hbase and Hive ORC along with possible change I can make to the merge process.
Any suggestion will be highly appreciated.
As far as I understand, you need to use full outer join between snapshot and journal(delta) and then use coalesce, for instance:
def applyDeduplicatedJournal(snapshot: DataFrame, journal: DataFrame, joinColumnNames: Seq[String]): DataFrame = {
val joinExpr = joinColumnNames
.map(column => snapshot(column) === journal(column))
.reduceLeft(_ && _)
val isThereNoJournalRecord = joinColumnNames
.map(jCol => journal(jCol).isNull)
.reduceLeft(_ && _)
val selectClause = snapshot.columns
.map(col => when(isThereNoJournalRecord, snapshot(col)).otherwise(coalesce(journal(col), snapshot(col))) as col)
snapshot
.join(journal, joinExpr, "full_outer")
.select(selectClause: _*)
}
In this case you will merge snapshot with journal with fallback to snapshot value in case when journal has null value.
Hope it helps!
I have very huge amount of data, which I plan to store in Cassandra. I am new to Cassandra and am trying to find a data model that will work for me.
My data is various parameters for commodities gathered over irregular time intervals:
commodity_id | timestamp | param1 | param2
c1 | '2018-01-01' | 5 | 15
c1 | '2018-01-03' | 7 | 15
c1 | '2018-01-08' | 8 | 10
c2 | '2018-01-01' | 100 | 13
c2 | '2018-01-02' | 140 | 13
c2 | '2018-01-05' | 130 | 13
c2 | '2018-01-06' | 150 | 13
I need to query the database, and get commodity IDs by "percentage change" in the params.
Ex. Find out all commodities whose param2 increased by more than 50% between '2018-01-02' and '2018-01-06'
CREATE TABLE "commodity" (
commodity_id text,
timestamp date,
param1 int,
param2 int,
PRIMARY KEY (commodity_id, timestamp)
)
You should be fine with this table. You can expect daysPerYear entries for a commodity partition, which is reasonably small so you dont need any artificial keys. Even if you have a large number of commodities, you wont run out of partitions, as the murmur3 partitioner actually has a range of -2^63 to +2^63-1. That are 18,446,744,073,709,551,616 possible values.
I would pull the data from cassandra and calculate the values in the app layer.
I have a table in google sheets like this one,
-------------------
| A | B | C | D |
-------------------
1 |C1 |C2 |C3 |C4 |
2 | 1 | 2 | 1 | 2 |
3 | 2 | 3 | 4 | 3 |
4 | 5 | 7 | 1 | 6 |
-------------------
My goal is to find which 2 columns C1,C2,C3 are closest to C4,
by calculate the average difference bewteen each column and column C4,
e,g Column C1 will have an averyage of abs( ( (1-2)+(2-3)+(5-6) ) /3 )
which is , abs( ( (A2-D2)+(A3-D3)+(A4-D4) )/(number of rows) )
I'm using ARRYFORMULA to get the average differece for one column and then I drag it horizontally so As will increase to Bs and so on
=ArrayFormula({A1;abs(average( (checks if there is empty cell) ,$D2:$D-(A2:A) )))})
if I use it in cell Z1, Z1 will show 'C1', and Z2 will show the average difference for column C1
but i'm not sure how to use a single nested formula to do it for all columns A:C at once, with out having to drag it
like I if I type =FORMULA(...) in Z1, and a table will show up
Thank you
Try the formula:
=QUERY(ARRAYFORMULA(ABS((ROW(A2:C)*COLUMN(A2:C))^0*D2:D26-A2:C26)),
"select avg(Col"&JOIN("), avg(Col",ArrayFormula(row(INDIRECT("A1:A"&COLUMNS(A2:C)))))&")")
Explanation
(ROW(A2:C)*COLUMN(A2:C))^0*D2:D26 -- copy C4 to compare it with others
"select avg(Col"&JOIN("), avg(Col"... -- compose query to get the average for each column.
Note: in your formula abs(average( must be replaced → average(abs( in order to complete abs function first.
I need to compute have a spark quantiles on a numeric field after a group by operation. Is there a way to apply the approxPercentile on an aggregated list instead of a column?
E.g.
The Dataframe looks like
k1 | k2 | k3 | v1
a1 | b1 | c1 | 879
a2 | b2 | c2 | 769
a1 | b1 | c1 | 129
a2 | b2 | c2 | 323
I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then compute quantiles [10th, 50th...] on list of v1's
you can use percentile_approx in spark sql.
Assuming your data is in df, then you can do:
df.registerTempTable("df_tmp")
val dfWithPercentiles = sqlContext.sql("select k1,k2,k3,percentile_approx(v1, 0.05) as 5th, percentile_approx(v1, 0.50) as 50th, percentile_approx(v1, 0.95) as 95th from df_tmp group by k1,k2,k3")
On your sample data, this gives:
+---+---+---+-----+-----+-----------------+
| k1| k2| k3| 5th| 50th| 95th|
+---+---+---+-----+-----+-----------------+
| a1| b1| c1|129.0|129.0|803.9999999999999|
| a2| b2| c2|323.0|323.0| 724.4|
+---+---+---+-----+-----+-----------------+