Storing Data Using Local Data Warehouse

A data warehouse serves as a centralized hub designed for the storage, management, and analysis of large datasets. This system integrates data from diverse sources, such as transactional systems, relational databases, and various data streams, into one comprehensive and cohesive database.


Local Data Warehouse In Databricks

Accessing The Unity Catalog & Legacy Hive Metastore

The Unity Catalog is a unified, fine-grained data governance solution that streamlines the security and management of data by offering a centralized platform for administering and auditing data access. It is automatically enabled and linked to a workspace upon creation.

Data can be stored in the form of metastores, catalogs and schemas, and accessed similarly to traditional data warehouses, with support for SQL, Python, Scala and R languages. Data can be classified into managed (internal) and external based on the actual storage location.

Older workspaces may still be associated with legacy Hive metastores that can be alongside the Unity Catalog. However, Databricks strongly advises migrating all legacy data to leverage the enhanced set of security and governance features introduced by the Unity Catalog.


Local Data Warehouse In Synapse

Accessing The Data Lake Storage Account

When establishing a workspace in Synapse, it can be linked to an existing data lake storage account. This data lake serves as the primary storage account for executing basic queries and commands within the Synapse workspace. Additionally, external storage accounts can be connected via linked services and accessed using Spark/SQL pools.

Migration Of Data From Databricks To Synapse

Managed Or Internal Data

Internal data from Databricks can be transferred to Synapse using a COPY activity within Azure Data Factory pipelines, but this method lacks reliability. A more dependable alternative involves creating a foreign catalog and running federated queries to access the data directly from Synapse.

External Data

External data from Databricks can be connected to Synapse directly through its original data source. Delta tables can be accessed via their original lake storage location, while SQL tables can be accessed using the JDBC connection string.

-- Delta Lake Storage --
CREATE TABLE <table-name> USING DELTA LOCATION "abfss://<delta-lake-path>/<table-path>"


-- SQL Storage --
CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:sqlserver://<database-server-name>:<port-number>;database=<database-name>",  
  dbtable "<table-name>",
  username "<server-username>",
  password "<server-password>"
)

Key Differences

External Hive Metastores In Synapse

By default, Synapse does not leverage the capabilities and speed provided by Apache Hive. Instead, it enables Apache Spark pools to utilize a shared managed Hive metastore as a catalog, allowing the sharing of catalog objects with other computational engines outside of the workspace, such as Databricks. Configuration of the metastore can involves linking an Azure SQL Database or Azure Database for MySQL to the workspace and adjusting settings at the Spark pool or session level.

Absence Of Global Sessions In Synapse

Due to the lack of global sessions in Synapse, directly mounting an external storage account to the Synapse file system for use across multiple sessions is not possible. Unlike Databricks, which can utilize both managed and external data storage within the same file system, Synapse can only store internal data within its primary data lake, accessing external data from other storage accounts through linked services.


References


Last updated