Spark View Advanced Usage

View in spark is a powerful data layer, on which you can apply flexciable logics and control data visibility as well as orgnaize different sources together.

I made lots of practice on view, it is worth to summarize the notice points on view advance usage. Before that, I did mutiple testing to make sure it is as correct as possible.

Scope

View not only could be a intermedia temporary code block, but also can be built permanently based on user requirements.

-- temporary view only effect in current session
create [or replace] temporary view temp_v as select ...;

-- permanent global view 
create [or replace] view p_cm_v.global_v as select ...;

-- drop permanent view
drop [if exists] view p_cm_v.global_v;

Column

View just like a table having its own structure including name and column even isNullable option. Taking below ddl as example to show how the columns name and type are generated. So that we know how to keep them unchanged when rebuilding the view.

create or replace view p_cm_v.global_v (
  name_1,
  age_1,
  sex_1,
  birthday_1
) as  
select name_2,
  age_2,
  sex_2,
  birthday_2
from table_a
union all
select name_3,
  age_3,
  sex_3,
  birthday_3 
from table_b;

Name

The name defining priority is based on position: *_1 > *_2 > *_3 (event though name is undefined).
If name is undefined, then name is sql state content, for example:

col_name	data_type	comment
CASE WHEN (1 = 1) THEN 1 ELSE 0 END	int	NULL

col_name	data_type	comment
CAST(1 AS INT)	int	NULL

Type

Briefly, the type is the combination of *_2 and *_3 if it is valid combination, otherwise will throw error.

In detail:

In Spark 3.0, when inserting values into table columns with different data types, type casting is performed according to the ANSI SQL standard. Some unreasonable type conversions (such as string to int and double to boolean) are not allowed.

In Spark 2.4 and earlier, type conversions are allowed during table inserts as long as the type conversions are valid casts. When inserting an out-of-range value into an integer field, the low-order bits of the value are inserted (same as Java/Scala numeric type casts). For example, if 257 is inserted into a field of type Byte, the result is 1. This behavior is controlled by the option spark.sql.storeAssignmentPolicy, which defaults to “ANSI”. Setting this option to “Legacy” restores the previous behavior.

https://learn.microsoft.com/zh-cn/azure/databricks/release-notes/runtime/7.x-migration#sql-datasets-and-dataframe

When spark.sql.storeAssignmentPolicy is set to ANSI(which is the default value since Spark 3.0), Spark SQL complies with the ANSI store assignment rules on table insertions. The valid combinations of source and target data type in table insertions are given by the following table.

SourceTarget	Numeric	String	Date	Timestamp	Interval	Boolean	Binary	Array	Map	Struct
Numeric	Y	Y	N	N	N	N	N	N	N	N
String	N	Y	N	N	N	N	N	N	N	N
Date	N	Y	Y	Y	N	N	N	N	N	N
Timestamp	N	Y	Y	Y	N	N	N	N	N	N
Interval	N	Y	N	N	Y	N	N	N	N	N
Boolean	N	Y	N	N	N	Y	N	N	N	N
Binary	N	Y	N	N	N	N	Y	N	N	N
Array	N	N	N	N	N	N	N	Y*	N	N
Map	N	N	N	N	N	N	N	N	Y*	N
Struct	N	N	N	N	N	N	N	N	N	Y*

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-ansi-compliance#store-assignment

isNullable

Hardly be set in Spark-SQL, but when you define a column by hard code, it will be set as true automatically.

create view ... as 
select 1 age,
map('name', name, 'country', country) info
from ...;

col_name	data_type	comment
age	int not null	null
info	map not null	null

It is an optional setting in Scala API.

Performance Tuning Case

View is not same as table. The flexciable logic also would bring performance issue when being queried. Where is logic, where could be optimized.

The initial version has performance issue when being queried in spark 2.3 because it envolved in Broadcast Nested Loop Join.

After analysing, we can join the full table and then filter the data.

Before:

create or replace view access_views.mh_event_ep as
select
  case
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.pa_msg_list)
    ) >= 1
    and size(
      array_intersect(split(canvas_msg_ids, ','), lkp.esp_msg_list)
    ) >= 1 then 'Nespresso'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.pa_msg_list)
    ) >= 1 then 'PA'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.esp_msg_list)
    ) >= 1 then 'Espresso'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.other_msg_list)
    ) >= 1 then 'Unknown'
  end email_type,
  a.*
from
  access_views.mh_event a
  join prs_restricted_v.msg_type_lkp lkp on sent_dt between lkp.start_dt
  and lkp.end_dt
where
  actn_dt between date_sub(current_date, 30)
  and current_date -- partition key
  and sent_dt between date_sub(current_date, 30)
  and current_date -- bussiness requested;

After:

create or replace view access_views.mh_event_ep as
select
  case
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.pa_msg_list)
    ) >= 1
    and size(
      array_intersect(split(canvas_msg_ids, ','), lkp.esp_msg_list)
    ) >= 1 then 'Nespresso'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.pa_msg_list)
    ) >= 1 then 'PA'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.esp_msg_list)
    ) >= 1 then 'Espresso'
    when size(
      array_intersect(split(canvas_msg_ids, ','), lkp.other_msg_list)
    ) >= 1 then 'Unknown'
  end email_type,
  a.*
from
  access_views.mh_event a
  join prs_restricted_v.msg_type_lkp lkp on 1 = 1
  and lkp.end_dt >= date_sub(current_date, 30)
where
  actn_dt between date_sub(current_date, 30)
  and current_date -- partition key
  and sent_dt between date_sub(current_date, 30)
  and current_date -- bussiness requested
  and sent_dt between lkp.start_dt
  and lkp.end_dt;