I'm trying to use Amazon Glue to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket. I need to get the data into Redshift, but the column limit in Redshift is 1,600.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
I believe I can use Amazon Glue for this. But, this is my first time using Glue. I'm struggling to figure out a good way to do this. Some of the pySpark-extension Transformations look promising (perhaps, "Map" or "Relationalize"). see http://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-etl-scripts-pyspark-transforms.html.
So, my question is: What is a good way to do this in Glue?
Thanks.
AWS Glue does not have an appropriate built-in GlueTransform subclass to convert single DynamicRecord into multiple (as usual MapReduce mapper can do). You either cannot create such a transform yourself.
But there are two ways to solve your problem.
Option 1: Using Spark RDD API
Let's try to perform exactly what you need: map single record to multiple ones. Because of GlueTransform limitations we will have to dive deeper and use Spark RDD API.
RDD has special flatMap method which allows to produce multiple Row's which are then flattened. The code for your example will look something like this:
source_data = somehow_get_the_data_into_glue_dynamic_frame()
source_data_rdd = source_data.toDF().rdd
unpivoted_data_rdd = source_data_rdd.flatMap(
lambda row: (
(
row.id,
getattr(row, f'{field}_name'),
getattr(row, f'{field}_value'),
)
for field in properties_names
),
)
unpivoted_data = glue_ctx.create_dynamic_frame \
.from_rdd(unpivoted_data_rdd, name='unpivoted')
Option 2: Map + Relationalize + Join
If you want to do the requested operation using only AWS Glue ETL API then here are my instructions:
First map every single DynamicRecord from source into primary key and list of objects:
mapped = Map.apply(
source_data,
lambda record: # here we operate on DynamicRecords not RDD Rows
DynamicRecord(
primary_key=record.primary_key,
fields=[
dict(
key=getattr(row, f'{field}_name'),
value=getattr(row, f'{field}_value'),
)
for field in properties_names
],
)
)
Example input:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male | 1|is_new | 1
67890|is_male | 0|is_new | 0
Output:
primary_key|fields
12345|[{'key': 'is_male', 'value': 1}, {'key': 'is_new', 'value': 1}]
67890|[{'key': 'is_male', 'value': 0}, {'key': 'is_new', 'value': 0}]
Next relationalize it: every list will be converted into multiple of rows, every nested object will be unnested (Scala Glue ETL API docs have good examples and more detailed explanations than Python docs).
relationalized_dfc = Relationalize.apply(
mapped,
staging_path='s3://tmp-bucket/tmp-dir/', # choose any dir for temp files
)
The method returns DynamicFrameCollection. In case of single array field it will contain two DynamicFrame's: first with primary_key and foreign key to flattened and unnested fields dynamic frame.
Output:
# table name: roottable
primary_key|fields
12345| 1
67890| 2
# table name: roottable.fields
id|index|val.key|val.value
1| 0|is_male| 1
1| 1|is_new | 1
2| 0|is_male| 0
2| 1|is_new | 0
The last logical step is to join these two DynamicFrame's:
joined = Join.apply(
frame1=relationalized_dfc['roottable'],
keys1=['fields'],
frame2=relationalized_dfc['roottable.fields'],
keys2=['id'],
)
Output:
primary_key|fields|id|index|val.key|val.value
12345| 1| 1| 0|is_male| 1
12345| 1| 1| 1|is_new | 1
67890| 2| 2| 0|is_male| 0
67890| 2| 2| 1|is_new | 0
Now you just have to rename and select the desired fields.
Related
I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
| data| includes| meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+
The two columns I'm interested in here are data and includes. For data, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
| author_id| created_at| id|public_metrics| text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+
Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users
As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")
I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
The sample RDD looks like:
(key1,(111,222,1)
(key1,(113,224,1)
(key1,(114,225,0)
(key1,(115,226,0)
(key1,(113,226,0)
(key1,(116,227,1)
(key1,(117,228,1)
(key2,(118,229,1)
I am currently doing a spark project. I want to filter the first and last elements where the third position in tuple values are '1' and '0' based on keys.
Is it possible to do it with reduceByKey? But after my research, I did not find a good logic to reach what I want. I want my result in the order which is the same as the output shown below.
Expected output:
(key1,(111,222,1)
(key1,(114,225,0)
(key1,(113,226,0)
(key1,(116,227,1)
(key2,(118,229,1)
Much appreciated.
If I understand correctly, you want the first "1", the first "0", the last "1" and the last "0" for each key, and maintain the order. If I were you, I would use the SparkSQL API to do that.
First, let's build your RDD (By the way, providing sample data is very nice, giving enough code so that we can reproduce what you did is ever better):
val seq = Seq(("key1",(111,222,1)),
("key1",(113,224,1)),
("key1",(114,225,0)),
("key1",(115,226,0)),
("key1",(113,226,0)),
("key1",(116,227,1)),
("key1",(117,228,1)),
("key2",(118,229,1)))
val rdd = sc.parallelize(seq)
// then I switch to dataframes, and add an id to be able to go back to
// the previous order
val df = rdd.toDF("key", "value").withColumn("id", monotonicallyIncreasingId)
df.show()
+----+-----------+------------+
| key| value| id|
+----+-----------+------------+
|key1|[111,222,1]| 8589934592|
|key1|[113,224,1]| 25769803776|
|key1|[114,225,0]| 42949672960|
|key1|[115,226,0]| 60129542144|
|key1|[113,226,0]| 77309411328|
|key1|[116,227,1]| 94489280512|
|key1|[117,228,1]|111669149696|
|key2|[118,229,1]|128849018880|
+----+-----------+------------+
Now, we can group by "key" and "value._3", keep the min(id) and its max and explode back the data. With a window however, we can do it in a simpler way. Let's define the following window:
val win = Window.partitionBy("key", "value._3").orderBy("id")
// now we compute the previous and next element of each id using resp. lag and lead
val big_df = df
.withColumn("lag", lag('id, 1) over win)
.withColumn("lead", lead('id, 1) over win)
big_df.show
+----+-----------+------------+-----------+------------+
| key| value| id| lag| lead|
+----+-----------+------------+-----------+------------+
|key1|[111,222,1]| 8589934592| null| 25769803776|
|key1|[113,224,1]| 25769803776| 8589934592| 94489280512|
|key1|[116,227,1]| 94489280512|25769803776|111669149696|
|key1|[117,228,1]|111669149696|94489280512| null|
|key1|[114,225,0]| 42949672960| null| 60129542144|
|key1|[115,226,0]| 60129542144|42949672960| 77309411328|
|key1|[113,226,0]| 77309411328|60129542144| null|
|key2|[118,229,1]|128849018880| null| null|
+----+-----------+------------+-----------+------------+
Now we see that the rows you are after are the ones with either a lag equal to null (first element) or a lead equal to null (last element). Therefore, let's filter, sort back to the previous order using the id and select the columns you need:
val result = big_df
.where(('lag isNull) || ('lead isNull))
.orderBy('id)
.select("key", "value")
result.show
+----+-----------+
| key| value|
+----+-----------+
|key1|[111,222,1]|
|key1|[114,225,0]|
|key1|[113,226,0]|
|key1|[117,228,1]|
|key2|[118,229,1]|
+----+-----------+
Finally, if you really need a RDD, you can convert the dataframe with:
result.rdd.map(row => row.getAs[String](0) -> row.getAs[(Int, Int, Int)](1))
We have a requirement in Spark where the every record coming from the feed is broken into set of entites.
Example {col1,col2,col3}=>Resource, {Col4,col5,col6}=> Account,{col7,col8}=>EntityX etc.
Now I need a unique identifier generated in the ETL Layer which can be persisted to the database table respectively for each of the above mentioned tables/entities.
This Unique Identifier acts a lookup value to identify the each table records and generate sequence in the DB.
First Approach was using the Redis keys to generate the keys for every entities identified using the Natural Unique columns in the feed.
But this approach was not stable as the redis used crash in the peak hours and redis operates in the single threaded mode.It woulbe slow when im running too many etl jobs parallely.
My Thought is to used a Crypto Alghorithm like SHA256 rather than Sha32 Algorithm has 32 bit there is possibility of hash collision for different values.were as SHA256 has more bits so the range of hash values = 2^64
so the Possibility of the HashCollision is very less since the SHA256 uses Block Cipher of 4bit to encryption.
But the Second option is not well accepted by many people.
What are the other options/solutions to Create a Unique Keys in the ETL layer which can looked back in the DB for comparison.
Thanks in Advance,
Rajesh Giriayppa
With dataframes, you can use the monotonicallyIncreasingId function that "generates monotonically increasing 64-bit integers" (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.sql.functions$). It can be used this way:
dataframe.withColumn("INDEX", functions.monotonicallyIncreasingId())
With RDDs, you can use zipWithIndex or zipWithUniqueId. The former generates a real index (ordered between 0 and N-1, N being the size of the RDD) while the latter generates unique long IDs, without further guarantees which seems to be what you need (https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD). Note that zipWithUniqueId does not even trigger a spark job and is therefore almost free.
thanks for the reply, I have tried this method which doesn’t give me me the correlation or surrogate primary key to search database.everytime I run the etl job indexes or numbers will be different for each record,if my dataset count changes.
I need unique I’d to correlate with dB record which matches only one record and the should be same record anytime in dB.
Is there any good design patterns or practices to compare etl dataset row to dB record with unique I’d.
This is a little late, but in case someone else is looking...
I ran into a similar requirement. As Oli mentioned previously, zipWithIndex will give sequential, zero-indexed id's, which you can then map onto an offset. Note, there is a critical section, so a locking mechanism could be required, depending on use case.
case class Resource(_1: String, _2: String, _3: String, id: Option[Long])
case class Account(_4: String, _5: String, _6: String, id: Option[Long])
val inDS = Seq(
("a1", "b1", "c1", "x1", "y1", "z1"),
("a2", "b2", "c2", "x2", "y2", "z2"),
("a3", "b3", "c3", "x3", "y3", "z3")).toDS()
val offset = 1001 // load actual offset from db
val withSeqIdsDS = inDS.map(x => (Resource(x._1, x._2, x._3, None), Account(x._4, x._5, x._6, None)))
.rdd.zipWithIndex // map index from 0 to n-1
.map(x => (
x._1._1.copy(id = Option(offset + x._2 * 2)),
x._1._2.copy(id = Option(offset + x._2 * 2 + 1))
)).toDS()
// save new offset to db
withSeqIdsDS.show()
+---------------+---------------+
| _1| _2|
+---------------+---------------+
|[a1,b1,c1,1001]|[x1,y1,z1,1002]|
|[a2,b2,c2,1003]|[x2,y2,z2,1004]|
|[a3,b3,c3,1005]|[x3,y3,z3,1006]|
+---------------+---------------+
withSeqIdsDS.select("_1.*", "_2.*").show
+---+---+---+----+---+---+---+----+
| _1| _2| _3| id| _4| _5| _6| id|
+---+---+---+----+---+---+---+----+
| a1| b1| c1|1001| x1| y1| z1|1002|
| a2| b2| c2|1003| x2| y2| z2|1004|
| a3| b3| c3|1005| x3| y3| z3|1006|
+---+---+---+----+---+---+---+----+
I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)