Spark ETL Unique Identifier for Entities Generated - apache-spark

We have a requirement in Spark where the every record coming from the feed is broken into set of entites.
Example {col1,col2,col3}=>Resource, {Col4,col5,col6}=> Account,{col7,col8}=>EntityX etc.
Now I need a unique identifier generated in the ETL Layer which can be persisted to the database table respectively for each of the above mentioned tables/entities.
This Unique Identifier acts a lookup value to identify the each table records and generate sequence in the DB.
First Approach was using the Redis keys to generate the keys for every entities identified using the Natural Unique columns in the feed.
But this approach was not stable as the redis used crash in the peak hours and redis operates in the single threaded mode.It woulbe slow when im running too many etl jobs parallely.
My Thought is to used a Crypto Alghorithm like SHA256 rather than Sha32 Algorithm has 32 bit there is possibility of hash collision for different values.were as SHA256 has more bits so the range of hash values = 2^64
so the Possibility of the HashCollision is very less since the SHA256 uses Block Cipher of 4bit to encryption.
But the Second option is not well accepted by many people.
What are the other options/solutions to Create a Unique Keys in the ETL layer which can looked back in the DB for comparison.
Thanks in Advance,
Rajesh Giriayppa

With dataframes, you can use the monotonicallyIncreasingId function that "generates monotonically increasing 64-bit integers" (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.functions$). It can be used this way:
dataframe.withColumn("INDEX", functions.monotonicallyIncreasingId())
With RDDs, you can use zipWithIndex or zipWithUniqueId. The former generates a real index (ordered between 0 and N-1, N being the size of the RDD) while the latter generates unique long IDs, without further guarantees which seems to be what you need (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD). Note that zipWithUniqueId does not even trigger a spark job and is therefore almost free.

thanks for the reply, I have tried this method which doesn’t give me me the correlation or surrogate primary key to search database.everytime I run the etl job indexes or numbers will be different for each record,if my dataset count changes.
I need unique I’d to correlate with dB record which matches only one record and the should be same record anytime in dB.
Is there any good design patterns or practices to compare etl dataset row to dB record with unique I’d.

This is a little late, but in case someone else is looking...
I ran into a similar requirement. As Oli mentioned previously, zipWithIndex will give sequential, zero-indexed id's, which you can then map onto an offset. Note, there is a critical section, so a locking mechanism could be required, depending on use case.
case class Resource(_1: String, _2: String, _3: String, id: Option[Long])
case class Account(_4: String, _5: String, _6: String, id: Option[Long])
val inDS = Seq(
("a1", "b1", "c1", "x1", "y1", "z1"),
("a2", "b2", "c2", "x2", "y2", "z2"),
("a3", "b3", "c3", "x3", "y3", "z3")).toDS()
val offset = 1001 // load actual offset from db
val withSeqIdsDS = inDS.map(x => (Resource(x._1, x._2, x._3, None), Account(x._4, x._5, x._6, None)))
.rdd.zipWithIndex // map index from 0 to n-1
.map(x => (
x._1._1.copy(id = Option(offset + x._2 * 2)),
x._1._2.copy(id = Option(offset + x._2 * 2 + 1))
)).toDS()
// save new offset to db
withSeqIdsDS.show()
+---------------+---------------+
| _1| _2|
+---------------+---------------+
|[a1,b1,c1,1001]|[x1,y1,z1,1002]|
|[a2,b2,c2,1003]|[x2,y2,z2,1004]|
|[a3,b3,c3,1005]|[x3,y3,z3,1006]|
+---------------+---------------+
withSeqIdsDS.select("_1.*", "_2.*").show
+---+---+---+----+---+---+---+----+
| _1| _2| _3| id| _4| _5| _6| id|
+---+---+---+----+---+---+---+----+
| a1| b1| c1|1001| x1| y1| z1|1002|
| a2| b2| c2|1003| x2| y2| z2|1004|
| a3| b3| c3|1005| x3| y3| z3|1006|
+---+---+---+----+---+---+---+----+

Related

Java Spark Dataset retrieve cell value

My Dataset looks like below, i want to fetch the 1st row,1st column value (A1 in this case)
+-------+---+--------------+----------+
|account|ccy|count(account)|sum_amount|
+-------+---+--------------+----------+
| A1|USD| 2| 500.24|
| A2|SGD| 1| 200.24|
| A2|USD| 1| 300.36|
+-------+---+--------------+----------+
I can do this as below :
Dataset finalDS = dataset.groupBy("account", "ccy").
agg(count("account"), sum("amount").alias("sum_amount"))
.orderBy("account", "ccy");
Object[] items = (Object[])(finalDS.filter(functions.col("sum_amount")
.equalTo(300.36))
.collect());
String accountNo = (String)((GenericRowWithSchema)items[0]).get(0);
2 questions :
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
Any other/more efficient way to do this ? I am aware of Dataframe/JavaRDD queries
You'd better use Dataset.head (javadocs) function in order to eliminate passing all the data to driver process. This will limit you to loading only 1st row to driver RAM instead of the entire dataset. You also can consider using take function to obtain first N rows.
Without the explicit cast Object[], there is a compile time failure, however I would have thought that this is an implicit cast. Why ? I suspect something to do with scala compilation.
It depends on how your dataset is typed. In case of Datarame (which is Dataset[Row], proof), you'll get an Array[Row] on call to collect. It's worth to mention the signature of collect functions:
def collect(): Array[T] = withAction("collect", queryExecution)(collectFromPlan)

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

Poor performance on Window Lag function for large Spark dataframes [duplicate]

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Transforming one row into many rows using Amazon Glue

I'm trying to use Amazon Glue to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket. I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
I believe I can use Amazon Glue for this. But, this is my first time using Glue. I'm struggling to figure out a good way to do this. Some of the pySpark-extension Transformations look promising (perhaps, "Map" or "Relationalize"). see http://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-etl-scripts-pyspark-transforms.html.
So, my question is: What is a good way to do this in Glue?
Thanks.
AWS Glue does not have an appropriate built-in GlueTransform subclass to convert single DynamicRecord into multiple (as usual MapReduce mapper can do). You either cannot create such a transform yourself.
But there are two ways to solve your problem.
Option 1: Using Spark RDD API
Let's try to perform exactly what you need: map single record to multiple ones. Because of GlueTransform limitations we will have to dive deeper and use Spark RDD API.
RDD has special flatMap method which allows to produce multiple Row's which are then flattened. The code for your example will look something like this:
source_data = somehow_get_the_data_into_glue_dynamic_frame()
source_data_rdd = source_data.toDF().rdd
unpivoted_data_rdd = source_data_rdd.flatMap(
lambda row: (
(
row.id,
getattr(row, f'{field}_name'),
getattr(row, f'{field}_value'),
)
for field in properties_names
),
)
unpivoted_data = glue_ctx.create_dynamic_frame \
.from_rdd(unpivoted_data_rdd, name='unpivoted')
Option 2: Map + Relationalize + Join
If you want to do the requested operation using only AWS Glue ETL API then here are my instructions:
First map every single DynamicRecord from source into primary key and list of objects:
mapped = Map.apply(
source_data,
lambda record: # here we operate on DynamicRecords not RDD Rows
DynamicRecord(
primary_key=record.primary_key,
fields=[
dict(
key=getattr(row, f'{field}_name'),
value=getattr(row, f'{field}_value'),
)
for field in properties_names
],
)
)
Example input:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male | 1|is_new | 1
67890|is_male | 0|is_new | 0
Output:
primary_key|fields
12345|[{'key': 'is_male', 'value': 1}, {'key': 'is_new', 'value': 1}]
67890|[{'key': 'is_male', 'value': 0}, {'key': 'is_new', 'value': 0}]
Next relationalize it: every list will be converted into multiple of rows, every nested object will be unnested (Scala Glue ETL API docs have good examples and more detailed explanations than Python docs).
relationalized_dfc = Relationalize.apply(
mapped,
staging_path='s3://tmp-bucket/tmp-dir/', # choose any dir for temp files
)
The method returns DynamicFrameCollection. In case of single array field it will contain two DynamicFrame's: first with primary_key and foreign key to flattened and unnested fields dynamic frame.
Output:
# table name: roottable
primary_key|fields
12345| 1
67890| 2
# table name: roottable.fields
id|index|val.key|val.value
1| 0|is_male| 1
1| 1|is_new | 1
2| 0|is_male| 0
2| 1|is_new | 0
The last logical step is to join these two DynamicFrame's:
joined = Join.apply(
frame1=relationalized_dfc['roottable'],
keys1=['fields'],
frame2=relationalized_dfc['roottable.fields'],
keys2=['id'],
)
Output:
primary_key|fields|id|index|val.key|val.value
12345| 1| 1| 0|is_male| 1
12345| 1| 1| 1|is_new | 1
67890| 2| 2| 0|is_male| 0
67890| 2| 2| 1|is_new | 0
Now you just have to rename and select the desired fields.

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Resources