Spark number of partitions logic for Join operator - apache-spark

I would like to understand how spark calculate number of partitions when joining data.
I am using spark 1.6.2 with yarn and Hadoop.
I have a code
val df1 = .....
val df2 = .... .cache() //small cached dataframe
//cartesian join
val joined = df1 join(broadcast(df2)) persist(StorageLevel.MEMORY_AND_DISK_SER)
println(df1.rdd.partitions.size ) //prints 10
println(df2.rdd.partitions.size ) //prints 28
println(joined.rdd.partitions.size ) //prints 33
can someone explain why the result is 33 ?
EDIT
== Optimized Logical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
+- Join Inner, None
:- InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
+- BroadcastHint
+- InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None
== Physical Plan ==
Project [key1#6L,key2#9,key3#21L,temp_index#33L,CASE WHEN (key1_type#2 = business) THEN ((rand#179 * 2000.0) + 10000.0) ELSE ((rand#179 * 2000.0) + 5000.0) AS amount#180]
+- Project [key1_type#2,key1#6L,key2#9,key3#21L,temp_index#33L,randn(-5800712378829663042) AS rand#179]
+- BroadcastNestedLoopJoin BuildRight, Inner, None
:- InMemoryColumnarTableScan [key1_type#2,key1#6L,key2#9,key3#21L], InMemoryRelation [key1_type#2,key1#6L,key2#9,key3#21L], true, 10000, StorageLevel(true, true, false, false, 1), BroadcastNestedLoopJoin BuildRight, Inner, None, None
+- InMemoryColumnarTableScan [temp_index#33L], InMemoryRelation [temp_index#33L], true, 10000, StorageLevel(true, true, false, true, 1), Project [id#32L AS temp_index#33L], None

Related

write delta lake in Databricks error: HttpRequest 409 err PathAlreadyExist

Sometimes I get this error when a job in Databricks is writing in Azure data lake:
HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=f448-0832-41ac-a2ab-8821453ef3c8,rid=7d4-101f-005a-578c-f82000000,connMs=0,sendMs=0,recvMs=38,sent=0,recv=168,method=PUT,url=https://awutmp.dfs.core.windows.net/bronze/app/_delta_log/_last_checkpoint?resource=file&timeout=90
My code read from a blob storage using autoloader and write in Azure Data Lake:
Schemas:
val binarySchema = StructType(List(
StructField("path", StringType, true),
StructField("modificationTime", TimestampType, true),
StructField("length", LongType, true),
StructField("content", BinaryType, true)
))
val jsonSchema = StructType(List(
StructField("EquipmentId", StringType, true),
StructField("EquipmentName", StringType, true),
StructField("EquipmentType", StringType, true),
StructField("Name", StringType, true),
StructField("Value", StringType, true),
StructField("ValueType", StringType, true),
StructField("LastSourceTimeStamp", StringType, true),
StructField("LastReprocessDate", StringType, true),
StructField("LastStateDuration", StringType, true),
StructField("MessageId", StringType, true)
))
Create delta table if not exists:
val sinkPath = "abfss://bronze#awutmp.dfs.core.windows.net/app"
val tableSQL =
s"""
CREATE TABLE IF NOT EXISTS bronze.awutmpapp(
path STRING,
file_modification_time TIMESTAMP,
file_length LONG,
value STRING,
json struct<EquipmentId STRING, EquipmentName STRING, EquipmentType STRING, Name STRING, Value STRING,ValueType STRING, LastSourceTimeStamp STRING, LastReprocessDate STRING, LastStateDuration STRING, MessageId STRING>,
job_name STRING,
job_version STRING,
schema STRING,
schema_version STRING,
timestamp_etl_process TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(file_modification_time)) COMMENT 'generated from file_modification_time',
month INT GENERATED ALWAYS AS (MONTH(file_modification_time)) COMMENT 'generated from file_modification_time',
day INT GENERATED ALWAYS AS (DAY(file_modification_time)) COMMENT 'generated from file_modification_time'
)
USING DELTA
PARTITIONED BY (year, month, day)
LOCATION '${sinkPath}'
"""
spark.sql(tableSQL)
Options:
val options = Map[String, String](
"cloudFiles.format" -> "BinaryFile",
"cloudFiles.useNotifications" -> "true",
"cloudFiles.queueName" -> queue,
"cloudFiles.connectionString" -> queueConnString,
"cloudFiles.validateOptions" -> "true",
"cloudFiles.allowOverwrites" -> "true",
"cloudFiles.includeExistingFiles" -> "true",
"recursiveFileLookup" -> "true",
"modifiedAfter" -> "2022-01-01T00:00:00.000+0000",
"pathGlobFilter" -> "*.json.gz",
"ignoreCorruptFiles" -> "true",
"ignoreMissingFiles" -> "true"
)
Method process each microbatch:
def decompress(compressed: Array[Byte]): Option[String] =
Try {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}.toOption
def binaryToStringUDF: UserDefinedFunction = {
udf { (data: Array[Byte]) => decompress(data).orNull }
}
def processMicroBatch: (DataFrame, Long) => Unit = (df: DataFrame, id: Long) => {
val resultDF = df
.withColumn("content_string", binaryToStringUDF(col("content")))
.withColumn("array_value", split(col("content_string"), "\n"))
.withColumn("array_noempty_values", expr("filter(array_value, value -> value <> '')"))
.withColumn("value", explode(col("array_noempty_values")))
.withColumn("json", from_json(col("value"), jsonSchema))
.withColumnRenamed("length", "file_length")
.withColumnRenamed("modificationTime", "file_modification_time")
.withColumn("job_name", lit("jobName"))
.withColumn("job_version", lit("1.0"))
.withColumn("schema", lit(schema.toString))
.withColumn("schema_version", lit("1.0"))
.withColumn("timestamp_etl_process", current_timestamp())
.withColumn("timestamp_tz", expr("current_timezone()"))
.withColumn("timestamp_etl_process",
to_utc_timestamp(col("timestamp_etl_process"), col("timestamp_tz")))
.drop("timestamp_tz", "array_value", "array_noempty_values", "content", "content_string")
resultDF
.write
.format("delta")
.mode("append")
.option("path", sinkPath)
.save()
}
val storagePath = "wasbs://signal#externalaccount.blob.core.windows.net/"
val checkpointPath = "/checkpoint/signal/autoloader"
spark
.readStream
.format("cloudFiles")
.options(options)
.schema(binarySchema)
.load(storagePath)
.writeStream
.format("delta")
.outputMode("append")
.foreachBatch(processMicroBatch)
.option("checkpointLocation", checkpointPath)
.trigger(Trigger.AvailableNow)
.start()
.awaitTermination()
It is aditional information I have seen in Azure log analytics:
How can I solve this error?

Python iterate over all possible combinations on boolean variables

I have 6 booleas variables in a dict and I want to run my code on all all their possible iteration.
so I have:
params["is_A"] = True/False
params["is_B"] = True/False
...
and then for all possible combinations, I want to call
my_func(params)
What is the best way to do so?
itertools.product can generate all the combinations:
import itertools
names = 'is_A is_B is_C is_D is_E is_F'.split()
def my_func(params):
print(params)
for p in itertools.product([True,False],repeat=6):
params = dict(zip(names,p))
my_func(params)
Output:
{'is_A': True, 'is_B': True, 'is_C': True, 'is_D': True, 'is_E': True, 'is_F': True}
{'is_A': True, 'is_B': True, 'is_C': True, 'is_D': True, 'is_E': True, 'is_F': False}
...
{'is_A': False, 'is_B': False, 'is_C': False, 'is_D': False, 'is_E': False, 'is_F': True}
{'is_A': False, 'is_B': False, 'is_C': False, 'is_D': False, 'is_E': False, 'is_F': False}

What Prettier/ESLint setting is this

Prettier/ESLint is taking this line:
(contentType != null && contentType.indexOf('javascript') === -1)) {
and making it into this:
if (
response.status === 404 ||
(contentType != null &&
contentType.indexOf('javascript') === -1)
) {
I'm using Webstorm, but the command line output is this:
$ eslint src
C:\Users\jtuzman\dev\...\front_end\src\serviceWorker.js
109:40 error Replace `·contentType.indexOf('javascript')·===·-1)` with `␍⏎····················contentType.indexOf('javascript')·===·-1)␍⏎············` prettier/prettier
✖ 1 problem (1 error, 0 warnings)
1 error and 0 warnings potentially fixable with the `--fix` option.
I've done some customizing of the prettier config (including explicitly writing out defaults):
module.exports = {
tabWidth: 4,
tabs: false,
semi: true,
singleQuote: true,
quoteProps: 'as-needed',
trailingComma: 'none',
bracketSpacing: true,
jsxBracketSameLine: true,
arrowParens: 'avoid'
};
But I can't tell what rule this is about. Perhaps it's an eslint rule. Anyone have ideas?

Convert HQL to SparkSQL

I'm trying to convert HQL to Spark.
I have the following query (Works in Hue with Hive editor):
select reflect('java.util.UUID', 'randomUUID') as id,
tt.employee,
cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date,
collect_set(tt.employee_detail) as employee_details,
collect_set( tt.emp_indication ) as employees_indications,
named_struct ('employee_info', collect_set(tt.emp_info),
'employee_mod_info', collect_set(tt.emp_mod_info),
'employee_comments', collect_set(tt.emp_comment) )
as emp_mod_details,
from (
select views_ctr.employee,
if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail,
if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info,
if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment,
if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info,
if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication,
from
( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr
) tt
group by employee
distribute by employee
First, What I'm trying is to write it in spark.sql as follow:
sparkSession.sql("select reflect('java.util.UUID', 'randomUUID') as id, tt.employee, cast( from_unixtime(unix_timestamp (date_format(current_date(),'dd/MM/yyyy HH:mm:ss'), 'dd/MM/yyyy HH:mm:ss')) as timestamp) as insert_date, collect_set(tt.employee_detail) as employee_details, collect_set( tt.emp_indication ) as employees_indications, named_struct ('employee_info', collect_set(tt.emp_info), 'employee_mod_info', collect_set(tt.emp_mod_info), 'employee_comments', collect_set(tt.emp_comment) ) as emp_mod_details, from ( select views_ctr.employee, if ( views_ctr.employee_details.so is not null, views_ctr.employee_details, null ) employee_detail, if ( views_ctr.employee_info.so is not null, views_ctr.employee_info, null ) emp_info, if ( views_ctr.employee_comments.so is not null, views_ctr.employee_comments, null ) emp_comment, if ( views_ctr.employee_mod_info.so is not null, views_ctr.employee_mod_info, null ) emp_mod_info, if ( views_ctr.emp_indications.so is not null, views_ctr.emp_indications, null ) employees_indication, from ( select * from views_sta where emp_partition=0 and employee is not null ) views_ctr ) tt group by employee distribute by employee")
But I got the following exception:
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failute: Task not serializable:
java.io.NotSerializableException:
org.apache.spark.unsafe.types.UTF8String$IntWrapper
-object not serializable (class : org.apache.spark.unsafe.types.UTF8String$IntWrapper, value:
org.apache.spark.unsafe.types.UTF8String$IntWrapper#30cfd641)
If I'm trying to run my query without collect_set function its work, It's can fail because struct column types in my table?
How can I write my HQL query in Spark / fix my exception?

How to read data from HBase table using pyspark?

I have created a dummy HBase table called emp having one record. Below is the data.
> hbase(main):005:0> put 'emp','1','personal data:name','raju' 0 row(s)
> in 0.1540 seconds
> hbase(main):006:0> scan 'emp' ROW
> COLUMN+CELL 1 column=personal
> data:name, timestamp=1512478562674, value=raju 1 row(s) in 0.0280
> seconds
Now I have establish a connection between HBase and pySparkusing shc. Can you please help me with the code to read the aboveHBase table as a dataframe in PySpark.
Version Details:
Spark Version 2.2.0, HBase 1.3.1, HCatalog 2.3.1
you can try like this
pyspark --master local --packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf.cloudera.hbase/hbase-site.xml
empdata = ''.join("""
{
'table': {
'namespace': 'default',
'name': 'emp'
},
'rowkey': 'key',
'columns': {
'emp_id': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},
'emp_name': {'cf': 'personal data', 'col': 'name', 'type': 'string'}
}
}
""".split())
df = sqlContext \
.read \
.options(catalog=empdata) \
.format('org.apache.spark.sql.execution.datasources.hbase') \
.load()
df.show()
[Refer this blog for more info]
https://diogoalexandrefranco.github.io/interacting-with-hbase-from-pyspark/

Resources