Storing Data Using Local Data Warehouse
A data warehouse serves as a centralized hub designed for the storage, management, and analysis of large datasets. This system integrates data from diverse sources, such as transactional systems, relational databases, and various data streams, into one comprehensive and cohesive database.
Local Data Warehouse In Databricks
Accessing The Unity Catalog & Legacy Hive Metastore
The Unity Catalog is a unified, fine-grained data governance solution that streamlines the security and management of data by offering a centralized platform for administering and auditing data access. It is automatically enabled and linked to a workspace upon creation.
Data can be stored in the form of metastores, catalogs and schemas, and accessed similarly to traditional data warehouses, with support for SQL, Python, Scala and R languages. Data can be classified into managed (internal) and external based on the actual storage location.
Local Data Warehouse In Synapse
Accessing The Data Lake Storage Account
When establishing a workspace in Synapse, it can be linked to an existing data lake storage account. This data lake serves as the primary storage account for executing basic queries and commands within the Synapse workspace. Additionally, external storage accounts can be connected via linked services and accessed using Spark/SQL pools.
Migration Of Data From Databricks To Synapse
Managed Or Internal Data
Internal data from Databricks can be transferred to Synapse using a COPY
activity within Azure Data Factory pipelines, but this method lacks reliability. A more dependable alternative involves creating a foreign catalog and running federated queries to access the data directly from Synapse.
External Data
External data from Databricks can be connected to Synapse directly through its original data source. Delta tables can be accessed via their original lake storage location, while SQL tables can be accessed using the JDBC connection string.
-- Delta Lake Storage --
CREATE TABLE <table-name> USING DELTA LOCATION "abfss://<delta-lake-path>/<table-path>"
-- SQL Storage --
CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://<database-server-name>:<port-number>;database=<database-name>",
dbtable "<table-name>",
username "<server-username>",
password "<server-password>"
)
Key Differences
External Hive Metastores In Synapse
By default, Synapse does not leverage the capabilities and speed provided by Apache Hive. Instead, it enables Apache Spark pools to utilize a shared managed Hive metastore as a catalog, allowing the sharing of catalog objects with other computational engines outside of the workspace, such as Databricks. Configuration of the metastore can involves linking an Azure SQL Database or Azure Database for MySQL to the workspace and adjusting settings at the Spark pool or session level.
Absence Of Global Sessions In Synapse
Due to the lack of global sessions in Synapse, directly mounting an external storage account to the Synapse file system for use across multiple sessions is not possible. Unlike Databricks, which can utilize both managed and external data storage within the same file system, Synapse can only store internal data within its primary data lake, accessing external data from other storage accounts through linked services.
References
Last updated