Databases
Default location of database dbfs:/user/hive/warehouse/<database_name>.db/
The folder that will contain the objects will have the name of the database and end with the .db
extension
JSON Data
The nested values of JSON data can be accessed using the :
operator
We can use the from_json()
and schema_of_json()
function to define a schema
Once an schema is applied on the data we can access nested data using the .
operator
Once a schema has been applied to JSON data the *
operator can be used to flatten the data into different columns
CREATE OR REPLACE TEMP VIEW new_events_final AS
SELECT json.* FROM parsed_events;
Array Functions
The explode
function can be used to put each record of the array as a new row
The collect_set
function can collect unique values for a field, including fields within arrays (NULL values are excluded)
The flatten
function allows multiple arrays to be combined into a single array
The array_distinct
function removes duplicate elements from an array
SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id
TRANSFORM: Allows to apply a Higher Order Function to each element in a array
Pivot Table
Pivot Table allows us to convert values in a column into columns
Pivot in SQL - Databricks
CREATE OR REPLACE TABLE transactions AS
SELECT * FROM (
SELECT
email,
order_id,
transaction_timestamp,
total_item_quantity,
purchase_revenue_in_usd,
unique_items,
item.item_id AS item_id,
item.quantity AS quantity
FROM sales_enriched
) PIVOT (
sum(quantity) FOR item_id in (
'P_FOAM_K',
'M_STAN_Q',
'P_FOAM_S',
'M_PREM_Q',
'M_STAN_F',
'M_STAN_T',
'M_PREM_K',
'M_PREM_F',
'M_STAN_K',
'M_PREM_T',
'P_DOWN_S',
'P_DOWN_K'
)
);