Data and Analytics Archives - Netwoven

Microsoft Fabric Internals Deep Dive

Gjnana Prakash Duvvuri — Fri, 23 Feb 2024 14:08:36 +0000

Introduction

Microsoft Fabric offers a cohesive Software as a Service (SaaS) solution, encompassing the essential functionalities for analytics across Power BI, Microsoft Azure Data Factory, and the upcoming iteration of Synapse. Fabric consolidates data warehousing, data engineering, data science, data integration, applied observability (through the Data Activator experience), real-time analytics, applied observability (through the Data Activator experience), and business intelligence within a unified architecture.

We try to give a clear and concise explanation to our customers on Microsoft Fabric internals so that this explanation gives customers a direction on how it differs from other data warehouses. How important role It play in enterprise digitalization.

Key Differentiators and Importance

Unified Experience

Unlike traditional siloed solutions, Fabric provides a seamless end-to-end experience for data management, analytics, and observability.

Efficiency

By consolidating services, Fabric streamlines workflows, reduces complexity, and accelerates time-to-insights.

Scalability

Fabric scales effortlessly to handle enterprise-scale data volumes and diverse workloads.

Strategic Impact

As organizations embrace digital transformation, Fabric becomes a strategic enabler for data-driven decision-making, innovation, and growth.

Microsoft Fabric isn’t just another data warehouse—it’s a holistic ecosystem that empowers enterprises to harness their data effectively and drive meaningful outcomes.

Microsoft Fabric Services

Let’s us deep dive into the fascinating world of Unified Data Management

1. Lakehouse Delta Lake Format with V-Order Compression and Versioning

The concept of a lakehouse combines the best data lakes and data warehouses. It allows organizations to store vast amounts of raw data while also providing a warehouse’s structure and query capabilities.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It ensures data consistency, reliability, and performance.
V-Order compression is a technique that compresses data using variable-length codes, optimizing storage efficiency.

2. Bin Compaction for Improved Performance

Bin compaction optimizes storage by grouping data into bins or segments. It reduces fragmentation and enhances query performance.

3. Virtual Warehouses and Serverless Pools

Virtual warehouses are scalable, on-demand computing resources for running queries against data stored in cloud data warehouses.
Serverless pools provide automatic scaling based on workload demand, allowing efficient resource utilization without manual provisioning.

4. Integrated Services

A unified approach integrates various data services, such as data cataloging, lineage tracking, and governance, into a cohesive platform.

5. BI Reporting and Data Science in One Platform

Having business intelligence (BI) reporting and data science tools within the same platform streamlines analytics workflows and promotes collaboration.

6. OneNote Book for Data Pipelines and Data Science

OneNote provides a collaborative environment for documenting data pipelines, experiments, and insights.

7. Zero Table Cloning

Eliminating table cloning reduces redundancy and simplifies data management.

8. Data Engineering Services

These services encompass tasks related to data ingestion, transformation, and preparation.

9. Shortcuts with APIs for Custom Code

Developers can create custom shortcuts using APIs, enhancing productivity and flexibility.

10. OneSecurity and Data Governance

Ensuring data security and governance across the entire data lifecycle is critical for compliance and risk management.

Migrating to Fabric Warehouse

In this eBook, we will outline how Microsoft Fabric can significantly reduce the issues facing traditional data warehousing and provide a scalable platform for future growth.

Get the eBook

Microsoft Fabric Data Warehouse is an intriguing cloud-native data warehousing solution that harnesses the power of the Polaris distributed SQL query engine.

Let’s get into the details

1. Polaris Engine

Stateless and Interactive: Polaris stands as a stateless, interactive relational query engine that drives the Fabric Data Warehouse. It’s designed to seamlessly unify data warehousing and big data workloads while segregating compute and state.
Optimized for Analytics: Polaris is a distributed analytics system, meticulously optimized for analytical workloads. It operates as a columnar, in-memory engine, ensuring high efficiency and robust concurrency handling.
Cell Abstraction: Polaris represents data using a unique “cell” abstraction with two dimensions:
- Distributions: Aligns data efficiently.
- Partitions: Enables data pruning.
Cell Awareness: Polaris elevates the optimizer framework in SQL Server by introducing cell awareness. Each cell holds its own statistics, vital for the Query Optimizer (QO). This empowers the QO to implement diverse execution strategies and sophisticated estimation techniques, unlocking its full potential.

2. Fabric Data Warehouse Features

Delta Lake Format: Fabric Warehouse persists data in Delta Lake format, ensuring reliability and transactional consistency.
Separation of State and Compute: By decoupling state and compute, Fabric Warehouse achieves enhanced resource scalability and flexible scaling.
Fine-Grained Orchestration: Task inputs are defined in terms of cells, allowing for fine-grained orchestration using state machines.
Cloud-Native and Scalable: Polaris, being cloud-native, supports both big data and relational warehouse workloads. Its stateless architecture provides the flexibility and scalability needed for modern data platforms.

Webinar: Data Science for Business with Microsoft Fabric. Watch Now.

Let’s break down the components of the above architecture diagram

1. Delta Lake

Delta Lake is an optimized storage layer that serves as the cornerstone for storing data and tables within the Databricks lakehouse architecture. It extends Parquet data files by adding a file-based transaction log for ACID transactions and scalable metadata handling.
It ensures data reliability, consistency, and quality within your lakehouse architecture.

2. V-Order

V-Order is a write-time optimization applied to the Parquet file format. It significantly enhances read performance under Microsoft Fabric compute engines (such as SQL, Spark, Power BI etc.).
By sorting, distributing row groups, using dictionary encoding, and applying compression, V-Order reduces network, disk, and CPU resources during reads, resulting in cost efficiency and improved performance.
It has a 15% impact on average write times but provides up to 50% more compression. Importantly, Delta Lake stays fully compliant with the open-source Parquet format, ensuring compatibility.

3. Unified Data Lakehouse

You’re aiming for a unified architecture that combines the best of both data lakes and data warehouses.
Here’s how you can structure it
- Bronze Zone: Raw, unprocessed data lands here.
- Silver Zone: Data is cleaned, transformed, and enriched.
- Gold Zone: Aggregated, curated data for business intelligence (BI) and machine learning (ML)/AI purposes.
- Data Warehouse: The gold zone serves as your data warehouse, providing a trusted source for BI queries.
- Fabric Copilot: Fabric Copilot ensures data quality and truth across all zones.

4. Integration

Delta Lake tables, with features like Z-Order, are compatible with V-Order.
You can control V-Order writes at the session level using Spark configurations.
Remember that V-Order is applied at the Parquet file level, orthogonal to other Delta features like compaction, vacuum, and time travel

In summary, your architecture combines Delta Lake, V-Order, and a unified data lakehouse to achieve trusted data quality for ML/AI, BI, and analytics.

The Fabric, a unified SaaS experience, integrates Data Observability within its architecture.

Here are the key points

Unified Platform: Fabric combines capabilities for analytics across Microsoft Azure Data Factory, Power BI, and the next-gen Synapse.
Comprehensive Offerings: Fabric provides Data Governance, Data Security, Data Integration, Data Engineering, Data Warehousing, Data Science, Real-time Analytics, Applied Observability (via the Data Activator), and Business Intelligence.

The Data Activator experience within the Fabric ecosystem is a novel module designed for real-time data detection and monitoring.

Let’s explore its key features

1. Real-Time Data Detection

The Data Activator continuously scans incoming data streams, identifying patterns, anomalies, and events in real time.
It leverages machine learning algorithms and statistical techniques to detect changes, spikes, or deviations from expected behavior.
Whether it’s sudden spikes in website traffic, unexpected sensor readings, or unusual transaction patterns, the Data Activator raises alerts promptly.

2. Monitoring and Alerting

Once detected, the Data Activator triggers alerts or notifications to relevant stakeholders.
These alerts can be customized based on severity levels, thresholds, and specific conditions.
Monitoring dashboards provide real-time visibility into data health, allowing data engineers and analysts to take immediate action.

3. Adaptive Learning

The Data Activator learns from historical data and adapts its detection algorithms over time.
As new data arrives, it refines its models, ensuring accurate and relevant alerts.
Adaptive learning helps reduce false positives and enhances the system’s responsiveness.

4. Integration with Fabric Components

The Data Activator seamlessly integrates with other Fabric components, such as data pipelines, data lakes, and analytics workflows.
It complements existing observability features, enhancing the overall data management experience.
By providing real-time insights, it empowers organizations to proactively address data quality, compliance, and operational challenges.

Conclusion

A unified data management approach combines data quality, observability, cataloging, governance, and lineage. It centralizes and automates data workflows, enabling organizations to harness the full potential of their data and analytics investments, and plays a key role in enterprise digitalization with AI

Microsoft Fabric Data Warehouse, powered by the Polaris engine, seamlessly bridges the gap between data warehousing and big data, all while embracing cloud-native principles.

the Data Activator experience is a crucial part of Fabric’s commitment to data observability, ensuring that data anomalies and issues are swiftly detected and addressed.

The post Microsoft Fabric Internals Deep Dive appeared first on Netwoven.

Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine

Gjnana Prakash Duvvuri — Tue, 06 Feb 2024 13:58:19 +0000

Introduction

Microsoft Fabric Data Warehouse is a cloud-native data warehousing solution that leverages the Polaris distributed SQL query engine.

Polaris is a stateless, interactive relational query engine that powers the Fabric Data Warehouse.

It is designed to unify data warehousing and big data workloads while segregating compute and state for seamless cloud-native operations Polaris is a distributed analytics system that is optimized for analytical workloads
It is built from the ground up to serve the needs of today’s data platforms
It is a columnar, in-memory engine that is highly efficient and handles concurrency well. I hope this helps you understand the inner workings of Microsoft Fabric Data Warehouse and Polaris Engine.

I hope this helps you understand the inner workings of Microsoft Fabric Data Warehouse and Polaris Engine.

Fabric Warehouse – Polaris Analytics Engine

The decoupling of compute and storage in Synapse Dedicated SQL Pool is a significant advantage over Microsoft Fabric Data Warehouse, as it allows for enhanced resource scalability and flexible resource scaling.

In stateful architectures, the state for inflight transactions remains stored in the compute node until the transaction commits, rather than being immediately hardened into persistent storage. As a consequence, in the event of a compute node failure, the state of non-committed transactions becomes lost, leaving no recourse but to terminate in-flight transactions. In summary, stateful architectures inherently lack the capability for resilience to compute node failure and elastic assignment of data to compute resources.

However, decoupling of compute and storage is not the same as decoupling compute and state. In stateless compute architectures, compute nodes are designed to be devoid of any state information, meaning all data, transactional logs, and metadata must reside externally. This approach enables the application to partially restart query execution in case of compute node failures and smoothly adapt to real-time changes in cluster topology without disrupting in-flight transactions.

The evolution of data warehouse architectures over the years.

Data abstraction

Polaris represents data using a “cell” abstraction with two dimensions:

Distributions (data alignment)
Partitions (data prunining)

Polaris significantly elevates the optimizer framework in SQL Server by introducing cell awareness, where each cell holds its own statistics, vital for the Query Optimizer (QO). The QO, benefiting from Polaris’ cell awareness, implements a wide array of execution strategies and sophisticated estimation techniques, unlocking its full potential. In Polaris, a dataset is represented as a logical collection of cells, offering the flexibility to distribute them across compute nodes to achieve seamless parallelism.

To achieve effective distribution across compute nodes, Polaris employs distributions that map cells to compute nodes and hash datasets across numerous buckets. This intelligent distribution enables the deployment of cells across multiple compute nodes, making computationally intensive operations like joins and vector aggregation attainable at the cell level, sans data movement, provided that the join or grouping keys align with the hash-distribution key.

Furthermore, partitions play a crucial role in data pruning, selectively optimizing data for range or equality predicates defined over partition keys. This optimization is employed only when relevant to the query, ensuring efficiency.

A remarkable feature is the physical grouping of cells in storage as long as they can be efficiently accessed (diagonal green and blue stripes cells in the image above), allowing queries to selectively reference entire cell dimensions or even individual cells based on predicates and operation types present in the query, granting unparalleled flexibility and performance.

The Polaris distributed query processing (DQP) operates precisely at the cell level, regardless of what is within each cell. The data extraction from a cell is seamlessly handled by the single-node query execution (QE) engine, primarily driven by SQL Server, and is extensible for accommodating new data types with ease.

Flexible assignment of cells to compute

The Polaris engine is resilient to compute failures because of the flexible cells allocation to compute nodes. When a node failure or topology change occurs (scale up or down), it’s possible to efficiently re-assign the cells of the lost node to the remaining topology. To achieve this flexibility, the system maintains a metadata state, which includes the assignment of cells to compute nodes at any given time, in a durable manner outside the compute nodes. This means that the critical information about the cell-to-compute node mapping is stored in a reliable and persistent external storage, ensuring its availability even in the face of node failures.

This design enhances the overall resilience and by adopting this approach, the Polaris engine can quickly recover from node failures or topology changes, dynamically redistributing cells to healthy compute nodes and ensuring uninterrupted query processing across the entire system.

From queries to task DAGs

The Polaris engine follows a two-phased approach for query processing:

1. Compilation using SQL Server Query Optimizer:

In the first phase, the Query Optimizer takes the query and generates all possible logical plans. A logical plan represents different ways the query can be executed without considering the physical implementation details.

2. Distributed Cost-Optimization:

In the second phase, it enumerates all the physical implementations corresponding to the previously generated logical plans. Each physical implementation represents a specific execution strategy, considering the actual resources available across the distributed system. The goal of this cost-optimization phase is to identify and select the most cost-efficient physical implementation of the logical plan. It then picks one with the least estimated cost and the outcome is a good distributed query plan that takes data movement cost into account.

A Task is a physical execution of an operator defined in the two-phased optimization. Each physical execution of an operator, as defined in the two-phased optimization, is seen as a directed acyclic graph (DAG).

A task has three components:

Inputs – Collections of cells for each input’s data partition.
Task template - Code to execute on the compute nodes
Output - dataset represented as a collection of cells produced by the task. It can be either an intermediate result or the final result to return to the user.

Basically, at run time, a query is transformed into a query task DAG, which consists of a set of tasks with precedence constraints.

Task Orchestration

A new design in Polaris is a novel hierarchical composition of finite state machines. The state machine lies in its hierarchical state machine composition, which captures the execution intent. Polaris takes a different approach from conventional Directed Acyclic Graph (DAG) execution frameworks by providing a state machine template that orchestrates the execution.

By using it, Polaris gains a significant advantage in terms of formalizing failure recovery mechanisms. The state machine recorder, which operates as a log, enables the system to observe and replay the execution history. This capability proves invaluable in recovering from failures, as it allows the system to precisely recreate the execution sequence and take corrective actions as needed.

A query has 3 aspects, the query DAG, the task templates, and tasks, and it is called an entity. The execution state of each entity is monitored through an associated state machine, encompassing a finite set of states and state transitions. Each entity’s state is a result of composing the states of the entities from which it is built. By utilizing state machines to track and manage the entities’ states, Polaris gains greater control over its overall execution, promoting better coordination, and facilitating the implementation of necessary actions based on the current state.

States can be:

Simple - used to denote success, failure, or readiness of a task template
Composite - It denotes an instantiated task template or a blocked task template

A composite state differs from a simple state in that its transition to another state is defined by the result of the execution of its dependencies.

In summary, the hierarchical state machine composition in Polaris ensures a structured representation of execution intent, providing better control over query execution, recovery from failures, and the ability to analyze and replay execution history.

Migrating to Fabric Warehouse

In this eBook, we will outline how Microsoft Fabric can significantly reduce the issues facing traditional data warehousing and provide a scalable platform for future growth.

Get the eBook

Service Architecture

Polaris Architecture

The Polaris architecture and all services within the pool are stateless. Data is stored remotely and is abstracted via a data cell. Metadata and transaction log state are off-loaded to centralized services. It means that two or more pools will share metadata and transaction log state. Placing the state in centralized services coupled with a stateless micro-service architecture within a pool means multiple compute pools can transactionally access the same logical database.

The Polaris architecture incorporates a stateless design. Data storage is maintained remotely and takes the form of an abstract data cell. The management of metadata and transaction log states is delegated to centralized services, facilitating shared state utilization among two or more pools. This strategy empowers multiple compute pools to achieve transactional access to a shared logical database.

The SQL Server Front End (SQL-FE) is the service responsible for compilation, authorization, authentication, and metadata.

The Distributed Query Processor (DQP) is responsible for distributed query optimization, distributed query execution, query execution topology management, and workload management (WLM).

Finally, a Polaris pool consists of a set of compute servers each with a dedicated set of resources (disk, CPU, and memory). Each compute server runs two micro-services:

Execution Service (ES) - that is responsible for tracking the life span of tasks assigned to a compute container by the DQP
SQL Server instance - that is used as the backbone for the execution of the template query for a given task and holding a cache on top of local SSDs

The data channel serves a dual purpose: it facilitates the transfer of data between compute servers and also acts as the pipe through which compute servers transmit results to the SQL Frontend (FE).

Tracking the complete journey of a query is the control flow channels responsibility and tracks the progression of the query from the SQL FE to the DQP and subsequently from the DQP to the Execution Server.

Migrate Traditional Data Warehouses to Fabric Modern Warehouse. Watch Now.

Auto-Scale

As demand fluctuates, the Polaris engine requests additional computational resources, effectively requesting more containers from the underlying Fabric capacity. This adaptive approach ensures seamless accommodation of workload peaks. Behind the scenes, the engine adeptly redistributes tasks to newly added containers, all the while maintaining the continuity of ongoing tasks. Scaling down is transparent and automatic when the workload drops utilization.

Resilience to Node Failures

The Polaris engine is resilient by autonomously recovering from node failures and intelligently redistributing tasks to healthy nodes. This functionality is seamlessly integrated into the hierarchical state machine, as discussed earlier. This mechanism plays a critical role in enabling effective scalability for large queries since the probability of node failure increases with the number of nodes involved.

Hot spot recovery

The Polaris engine manages challenges like hot spots and skewed computations through the integration of a feedback loop between the DQP and the Execution Service. This mechanism monitors the lifecycle of execution tasks hosted on nodes. Upon detecting an overloaded compute node, it automatically redistributes a subset of tasks to a less burdened compute node, If this doesn’t mitigate the issue, the Polaris engine seamlessly falls back to its auto-scale feature, which enables the addition of computational resources to effectively mitigate the issue.

Conclusion:

Separation of state and compute. Flexible abstraction of datasets as cells. Task inputs are defined in terms of cells. Fine-grained orchestration of tasks using state machines will give more flexibility and scalability.

Delta Optimized V-order – write time optimizations to parquet file format V-Order works by applying special sorting, row group distribution, dictionary encoding and compression on parquet files will bring all data into open file format which will perform all bin compaction hence no need of writing manual code for cleanup of data. Polaris is cloud-native which now supports both big data and relational warehouse workloads and the stateless architecture provides flexibility and scalability.

Though some of the functions from the dedicated SQL pool (DW) are missing, we feel Fabric Data Warehouse is promising. We are working on a benchmark comparison in our next blogs.

Disclaimer: – Some of the content presented in this blog is from the original Reasearch paper from PVLDB Reference Format: Josep Aguilar-Saborit, Raghu Ramakrishnan et al. VLDB Conferences. Microsoft Corp. We have added our comments and views.

The post Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine appeared first on Netwoven.

Thinking of Data Democratization? Microsoft Fabric Will Help You Adopt Data Mesh Culture

Pankaj Bose — Tue, 14 Nov 2023 13:47:54 +0000

Introduction

Like hundreds of leading enterprises, you probably have realized that treating data as assets, closely managed by a highly specialized central team is creating huge bottleneck for actual data owners and consumers.

If you are still continuing with a centralized data culture, you might be feeling constrained trying to scale appropriately to accommodate huge influx of data volume, diversity and demand for data driven insight.

If you have reached here from Google search, you must already be contemplating seriously a transition to a culture where data should be managed in the domain, that owns the data and domains are responsible for sharing data with others inside or outside the organization.

Why monolithic data architecture is not supporting big enterprises anymore?

Monolithic data architectures like data warehouses and data lakes, were designed with the concept of storing and serving an organization’s vast amount of operational data in one centralized location. The thought process was that the specialized data team would ensure the data is clean, accurate and properly formatted. The data consumers are expected to be served with high quality contextual data for wide range of analytical use cases. However, in reality, this is not always the case. While centralization strategies have witnessed initial success for organizations with smaller data volume and fewer data consumers, with the increase of data sources and consumers, it started to develop bottlenecks.

As enterprises grow, their data requirements become more complex. Monolithic architectures are often difficult to scale horizontally to meet increased data volumes and processing demands. This can lead to performance bottlenecks and limit an organization’s ability to handle big data effectively. As enterprises grow, their data requirements become more complex.

Monolithic architectures are often difficult to scale horizontally to meet increased data volumes and processing demands. .This can lead to performance bottlenecks and limit an organization’s ability to handle big data effectively.

Over 80% of enterprise data remains as Dark Data. This data does not help organization with any insight to make any business decision.

Modern enterprises deal with a wide variety of data types, including structured, semi-structured, and unstructured data. Monolithic architectures are typically optimized for handling structured data, making it challenging to efficiently process and analyze diverse data sources. Different business units of an enterprise might have completely different data needs in terms of source, data types and processing logic. Accommodating all of these diverse requests has become really challenging for a central team both in terms of domain knowledge and technology involved. This results in mounting frustrations among data owners and consumers and much of data may not even be referred to the central data team. As a result, a lion’s share of enterprise data remains unexplored – referred as Dark Data.

We can summarize the challenges of enterprises having a monolithic data architecture as below:

Disconnects data and data owners (product/service experts).
Data processing architecture is not in alignment business axis of change.
Tight coupling between stages of data processing pipeline impacts flexibility that business needs.
Creates highly specialized and isolated engineering team
Creates backlogs focused on technical not on business functional changes.

How Data Mesh i.e., domain centric federated data management is helping big enterprises?

Data Mesh is founded on four principles:

Domain Oriented Ownership

Data is owned and managed by the teams or business units (domains) that generate and use it. This aligns data responsibility with the domain’s expertise, making it more manageable and relevant to its specific context.

Data as a Product

Data is treated as a product rather than a byproduct of business processes. Each domain is responsible for creating and maintaining its own data as a product, which is designed to meet the needs of the domain’s consumers. Data products include well-defined data sets, APIs, and documentation.

Self-serve Data Platform

Development of self-serve data infrastructure that enables easy and secure access to data for data consumers. This infrastructure includes data catalogs, data discovery tools, and standardized interfaces for accessing data products.

Federated Computational Governance

A structured approach to governance within a data mesh. It ensures that domains maintain a degree of independence while adhering to essential governance standards, fostering interoperability, and enabling the organization to leverage the full potential of its data ecosystem. This approach relies on the fine-grained codification and automation of policies to streamline governance activities and reduce overhead.

The basic concept of Data Mesh is simple – help data owners like functional or business units, manage their own data as they understand their data best. Removing dependency on a central data team, domains can enjoy autonomy and can scale as needed.

Along with the independence comes accountability as a single team is responsible for the data from production to consumption. This encourages domain teams to take responsibility for the quality, accuracy, and accessibility of their data. This, in turn, can lead to better data governance and more reliable data.

As domains take the ownership of data, they ensure that potential consumers of their data, be it other domains within the organization or external consumers, can easily discover, trust and access data. Also, they ensure published dataset follow certain standards in terms of schema and metadata so that data can be interoperable with another dataset. This is where product mindset becomes relevant, and data is managed and published as product. This makes it easier for data consumers to find the data they need, fostering self-service analytics and reducing the time and effort required to locate relevant information.

Domain-centric architectures are designed to scale with an enterprise’s growing data needs. When new domains or data sources are added, they can be integrated without significantly affecting existing domains. This flexibility allows organizations to adapt to changing business requirements and incorporate new data sources and technologies more easily.

Data silos are a common problem in large enterprises. A domain-centric approach helps break down these silos by promoting collaboration and data sharing between different parts of the organization. Domains can act as data product teams, providing standardized, well-documented data interfaces for others to use.

Domain-centric architectures encourage smaller, more focused teams to develop and manage data products. This can lead to quicker development cycles, faster iterations, and the ability to innovate more rapidly. It also reduces the risk of bottlenecks in data delivery. Also, because of the ownership, teams are more likely to ensure data quality and consistency within their specific domains. This can result in better data reliability and trustworthiness across the enterprise.

What kind of challenges enterprises face while adopting domain centric data management?

Analysis of experiences of some enterprises, challenges companies face adopting domain centric federated model can be classified mainly under three categories:

Management acceptance
Dealing with cultural shift
Governance challenges

We need to accept this is a significant change that needs support from top management in pushing down the change through ranks of organization. The main challenge is structural. For decades industries are habituated in dealing data in centralized manner. Organization roles and responsibilities got defined accordingly. Changing over to data federation model, threatens to significantly alter scope of those roles. Along with that, domains need to be equipped with skills, infrastructure and controls to perform data processing and management on their own.

While each domain has autonomy, there’s a need for consistent data governance across domains. Coordinating and enforcing data governance policies at the enterprise level while allowing autonomy at the domain level can be a delicate balance. This applies to both consistent data quality and interoperability of data produced by different domains. Ensuring that each domain adheres to the enterprise’s data quality standards while allowing flexibility for domain-specific requirements is a delicate task. Similarly, ensuring each domain adhere to similar data formats, schemas, or data processing technologies is crucial.

To overcome these challenges, a dedicated team empowered by top management’s commitment need to work in phases with different domains, beginning with few pilot transitions. Doing pilot programs with most willing groups increases the chance of early success and helps wins trust of others across the organization.

Treat Your Data as Product with Microsoft Fabric

Welcome to a new era of data management and transformation! Our latest ebook, “Treat Your Data as Product with Microsoft Fabric”, is your definitive guide to revolutionizing the way you perceive, manage, and leverage your data assets.

Get the eBook

Bonus Read: Data Fabric vs. Data Mesh

Microsoft Fabric will help ease transition challenges

Microsoft Fabric has been designed to support organizations adopt the domain centric data culture in a streamlined manner. As Data Mesh is a socio-technical endeavor, organizations need to gear themselves up for the change, while Fabric can largely address the technical aspects.

Microsoft Fabric consolidates many technologies under one platform to offer end to end data management with great flexibility to accommodate diversity in organizational culture, structure, and processes. Four principles of Data Mesh can be mapped to one or more of Fabric components.

Data Mesh architecture empowers business units to structure their data as per their specific need. Organizations can define data-boundaries and consumption pattern as per business needs. Fabric allows define domains for organizations that map to business units and associating workspaces to each domain. Data artifacts like lakehouse, pipeline, notebook, etc can be created within workspaces. Federated governance can be applied to granular level through domains and workspaces.

Fabric design has also implemented framework to support dataset as product, which is a major recommendation of Data Mesh. Any dataset that is promoted from any workspace, will be listed by OneLake as available dataset and becomes discoverable. Listed dataset is published with metadata that helps consumers with information to get enough details about the dataset. Workspace owner can also mark dataset as certified, which makes the dataset trustworthy for consumers. Dataset can be published through multiple endpoints to facilitate access through native tools and methods. While doing so, specific addresses for these endpoints are also published through OneLake. This satisfies Data Product characteristics like addressable and natively accessible. Microsoft Purview is now integrated with Fabric to further bolster data discovery. So, using Fabric, organizations get the right technology support to package their datasets as product while sharing inside or outside the organization.

Fabric makes it easy for data owners to perform most of the data processing activities themselves using OneLake interface. Very little core and complex data processing and transformation skills are needed to perform such activities in Fabric. This enables a large section of business users who are actual data owners and/or consumers to do self-service on their data requirements.

In terms of Federated Governance, another principle of Data Mesh, Fabric is making significate stride in this direction. Microsoft Purview is a unified data governance service that is integrated with Fabric to enable a foundational understanding of your OneLake. Purview helps with automatic data-discovery, sensitive data classification, end-to-end data lineage and enable data consumers to access valuable, trustworthy data.

Don’t miss this opportunity to expand your knowledge and harness the full potential of Microsoft Fabric. Click here to watch our webinar.

What is your next step towards data democratization?

By now, you must agree that data democratization is a must for your organization if you envision continued growth. You do not want a severe bottleneck in the form of a centralized data architecture supported by a highly specialized team. You want to provide your business units more data autonomy so they can scale as business demands.

You also have understood that the transition is not easy as this is a techno-socio change. You do need support of a good technology platform as well as a good partner with experience of driving similar transition with other enterprises. Microsoft Fabric has a great promise to be one such platform. You may refer the eBook – Deliver your data as product for consumers using Microsoft Fabric, on detailed transition steps, challenges and Fabric features that helps you implementing Data Mesh principles for your organization.

For a workshop on your prospect of transition to Data Democratization, please contact Netwoven Inc.

The post Thinking of Data Democratization? Microsoft Fabric Will Help You Adopt Data Mesh Culture appeared first on Netwoven.

Microsoft Fabric vs Synapse: A Comparative Study

Gjnana Prakash Duvvuri — Fri, 06 Oct 2023 16:28:20 +0000

Introduction:

Microsoft Fabric, unveiled by Microsoft, stands as a unified Software as a Service (SaaS) platform intricately linked with existing Microsoft analytical tools such as Power BI, Azure Data Explorer, Azure Synapse Analytics and Azure Data Factory. It delivers numerous benefits across seamless data discovery, streamlined integration time and data exchange.

Fabric’s data and operations find a home in OneLake, a unified, singular, logical data lake catering to the entire enterprise, embracing the concept of “data as a product” to support Data Mesh architecture. OneLake serves as a multi-cloud data lake, enabling enterprises to virtualize data lake storage across AWS S3, ADLS Gen2, and Google Storage, seamlessly integrating with Oracle OFS, SAP and other CRMs. Existing data lakes can be seamlessly incorporated within Fabric.

Within the Fabric ecosystem, all data is meticulously preserved in delta lake format, including warehouse data. Traditional relational storage is phased out, eliminating the need for maintaining separate data sets for data warehousing, data lakes, real-time analytics and business intelligence. Instead, all workloads can draw directly from a unified data repository within OneLake.

Microsoft has integrated Fabric Copilot into the Fabric environment, enhancing functionalities across dataflows, data engineering, and Power BI. Fabric consolidates nearly all Synapse functionalities, offering enhanced performance and integrated services within a single platform, promising a better return on investment as depicted in the architecture mapping image below.

Azure Synapse vs Fabric

Unlocking the Potential of Microsoft Fabric in Modern Advanced Analytics

In this comprehensive Ebook, we invite you to embark on a transformative journey, delving into the cutting-edge capabilities of Microsoft Fabric and OneLake.

Get the eBook

Synapse vs Fabric: How do they differ from one another?

Fabric operates as a Software as a Service (SaaS), offering less control and a fully managed service, while Synapse functions as a Platform as a Service (PaaS), providing more control and a higher level of responsibility.

The Fabric warehouse operates on the Polaris engine, which supersedes the SQL distributed engine and also drives the Serverless SQL pool in Synapse. This massive parallel processing (MPP) engine automatically scales to accommodate various data workloads.
Dataflows Gen2 (Power Query) can serve as an alternative to Mapping Data Flows. Synapse Mapping Data Flows offer a graphical user interface for “no code / low code” data transformation within Synapse, a feature not supported in Fabric.
The use of OPENROWSET() is not supported. However, querying data from the lake(house) in Fabric is still possible using T-SQL with a SQL endpoint. It requires modifying all queries that employ OPENROWSET syntax.
While the Synapse Link feature is absent in Fabric, it is substituted with Mirroring. Fabric introduces the Data Activator module, designed for real-time data detection and monitoring. It can send notifications and execute actions upon detecting specific data patterns.
The Spark engine in Fabric spins up significantly faster than the Synapse engine in 2 to 4 seconds against 2 minutes. It is a native integrated part of Fabric.

Don’t miss this opportunity to expand your knowledge and harness the full potential of Microsoft Fabric. Watch our webinar on-demand!

Conclusion

Fabric enables next-generation data and analytics capabilities such as Lakehouse, Data Mesh, and Data Virtualization. However, keep in mind that everything announced is still in preview. Organizations that utilize Azure Synapse Analytics should consider Microsoft Fabric and how it fits with their technological roadmap. Certain Synapse features (most notably, Mapping Dataflows and OPENROWSET() syntax in SQL queries over files in a data lake) are not supported in Fabric. As such, for Synapse vs Azure Fabric there exists no “lift-and-shift” approach.

The post Microsoft Fabric vs Synapse: A Comparative Study appeared first on Netwoven.

Data Mesh vs. Data Fabric – Navigating Modern Data Architecture

Gjnana Prakash Duvvuri — Thu, 31 Aug 2023 09:48:03 +0000

Introduction:

In today’s data-driven world, organizations are continually seeking innovative approaches to manage and harness the power of their data. Two emerging concepts that have garnered significant attention are Data Fabric and Data Mesh. Both aim to address the challenges of handling vast amounts of data, but they do so in distinct ways. In this blog, we’ll explore the key principles, benefits, and considerations of Data Fabric vs Data Mesh to help you make informed decisions about your data architecture strategy.

What is Data Mesh

Domain-oriented decentralized data ownership and architecture
Data as a product
Self-serve data infrastructure as a platform
Federated computational governance

What is Data Fabric

Augmented Data Catalog
Persistence Layer
Knowledge Graph
Insights and Recommendations Engine
Data Preparation and Data Delivery Layer
Orchestration and Data Ops
 Automation AI/ML 
APIs

Data Fabric vs Data Mesh: What’s the difference?

The difference between the two concepts lies in how users access data. 
Data fabric and data mesh provide architecture to access data across multiple technologies and platforms. “But a data fabric is technology-centric, while a data mesh focuses on organizational change.
Data fabric represents a comprehensive integrated architectural layer facilitating the connection between data and analytical processes. It utilizes pre-existing metadata assets to facilitate the design, deployment, and effective management of data. The objective of data fabric is to expedite the derivation of insights from data via automated processes and offer real-time analytics. It seamlessly merges data and analytics, aiming to provide timely insights.It acts as a management solution, enabling seamless access within a distributed environment. AI/ML is inbuilt.
Data Mesh is a highly decentralized data architecture equipped to address challenges including lack of ownership of data, lack of quality data and scaling bottlenecks. The goal of data mesh is to treat data as a product, with each source having a data product owner who could be part of the cross-functional team of data engineers. Data mesh — introduced by Zhamak Dehghani of Thoughtworks in May 2019– overcomes the problems of traditional data lakes and data warehouses.
Data Fabric fits under Data Mesh very well.

Unlocking the Potential of Microsoft Fabric in Modern Advanced Analytics

In this comprehensive Ebook, we invite you to embark on a transformative journey, delving into the cutting-edge capabilities of Microsoft Fabric and OneLake.

Get the eBook

Approach:

Automation vs Human Inclusion

Data Mesh adopts a people- and process-centric approach towards data, treating it as a product.
Data fabric harnesses both human and machine capabilities to access data in its original location or facilitate consolidation when necessary. It integrates technologies to link various data sources, types, and locations, employing diverse methods for data access.
Gartner uses the analogy of a self-driving car to illustrate this concept:
- Data fabric acts as a passive observer, monitoring data pipelines and suggesting more efficient alternatives. As both the data “driver” and machine learning algorithms become accustomed to recurring scenarios, they complement each other by automating routine tasks, allowing leaders to focus on innovation.

Data storage: Centralized vs Decentralized.

Data access: APIs vs-controlled datasets

Within Data Mesh, data is accessible through controlled datasets. Initially, data is extracted from departmental data stores and consolidated into a unified location.
In Data Fabric, data is accessed through purpose-driven APIs. Data is segregated into tailored datasets for particular use cases, with the respective business unit maintaining control over the data.
Data fabric continuously identifies, connects, and enriches real-time data from different applications to discover relationships between data points. It does so by building a graph storing interlinked data descriptions that algorithms can use for business analytics.

Data Fabric Vs. Data Mesh: Main Differences

	Data Fabric	Data Mesh
Architecture	Data is centralized. Data made available through APIs. Aims to eliminate human effort with machine learning and AI.	Data is stored within each domain of a company. Data is copied into specific datasets for specific use-cases. Less emphasis on AI, since work is handled by domain experts.
Benefits	Self-service data consumption and collaboration. Automates governance, data protection, and security. Automates data integration and data engineering.	Agility and scalability with fast access and accurate data delivery. Platform connectivity and data security. Robust data governance and end-to-end compliance.
Use Cases	Business applications – challenges of data availability and reliance for business applications. Data discovery – what data is available and where. Machine learning – minimizes the data preparation phase when training ML models.	Financial sector – fast fraud threat analysis without copying data to a central database. Sales and marketing – targeted campaigns based on user profiles. Machine learning – create virtual data warehouses as a basis for training ML models.

Don’t miss this opportunity to expand your knowledge and harness the full potential of Microsoft Fabric. Watch Now.

It’s essential to recognize that data mesh and data fabric are not mutually exclusive concepts. Organizations have the opportunity to utilize both approaches in various use cases.

Data Mesh is ideal for hybrid cloud networks.
Data fabric enables single-point data access, address data quality and storage issues and handling of security threats.
The distinction between the two concepts lies in the way users retrieve data.
Data fabric and data mesh offer architectural solutions for accessing data across diverse technologies and platforms. However, a data fabric centers around technology, whereas a data mesh prioritizes organizational transformation.
Data mesh emphasizes people and processes over architecture, whereas a data fabric is an architectural approach that adeptly addresses the intricacies of data and metadata in a cohesive manner.

Architecture

Conclusion on Data Fabrics vs Data Mesh

To summarize, both data fabric and data mesh provide powerful solutions to make your organization data-driven and even data-led. Data fabric allows everyone (within permission) easy access to data at the right time. Data mesh takes a decentralized approach by keeping separate domain-specific datasets.
Choosing one over the other essentially boils down to the problem your organization is dealing with.

The post Data Mesh vs. Data Fabric – Navigating Modern Data Architecture appeared first on Netwoven.

Design A Consumer Centric Data Architecture with Microsoft Fabric Lakehouse

Pankaj Bose — Fri, 18 Aug 2023 05:35:03 +0000

Data is produced for consumption!

Any data that is ever produced has a sole objective of being consumed by some person or application at the right time to trigger an action, help with some decision making or something else. Data characteristics like format, type, structure, schema, etc. may vary depending on the objectives of the consumer. That’s why understanding of the consumer needs is important so that data can be processed, packaged, and served as per the specific needs of the consumer.

What it takes to build a consumer centric data model?

To facilitate a consumer centric data culture, the data architecture should support few key principles:

Personalization:

The architecture focuses on tailoring data experiences to individual users or groups. This might involve recommending relevant data, providing personalized dashboards, or offering data in formats that suit each consumer’s preferences.

Ease of Use:

The architecture emphasizes simplicity and user-friendliness. It ensures that consumers can easily access, search, and understand the data they need without requiring advanced technical skills.

Data Accessibility:

Consumer-centric architecture ensures that data is easily accessible to authorized users. It might involve implementing secure authentication and authorization mechanisms to control who can access specific data sets.

Data Quality:

High data quality is a priority to ensure that the information available to consumers is accurate, consistent, and reliable. Data validation, cleansing, and enrichment processes are often part of this architecture.

Real-time or Near-real-time Access:

Depending on the needs of the consumers, the architecture might provide real-time or near-real-time access to data. This is particularly important for time-sensitive decision-making.

Scalability and Performance:

The architecture should be able to handle growing amounts of data and increasing numbers of users without sacrificing performance. Scalability ensures that consumers can interact with the data smoothly even as the system grows.

Integration:

Consumer-centric architecture often involves integrating data from various sources to provide a comprehensive view. Integration might include both internal and external data sources.

Feedback Loop:

The architecture should allow consumers to provide feedback on data quality, relevance, and usability. This feedback loop helps improve the architecture over time.

Data Governance:

Consumer-centric architecture includes mechanisms for data governance, including data lineage, data cataloging, and ensuring compliance with data privacy regulations.

Analytics and Insights:

The architecture should support data analytics and reporting, allowing consumers to gain insights from the data and make informed decisions.

Flexibility:

The architecture should be adaptable to changing consumer needs and evolving data requirements. It might involve using technologies like APIs to enable easy integration with various tools and applications.

Overall, consumer-centric data architecture seeks to create an environment where data is not just stored and managed but is also effectively utilized to meet the needs of users, whether they are business analysts, executives, customers, or any other stakeholders relying on data-driven insights.

Unlocking the Potential of Microsoft Fabric in Modern Advanced Analytics

In this comprehensive Ebook, we invite you to embark on a transformative journey, delving into the cutting-edge capabilities of Microsoft Fabric and OneLake.

Get the eBook

Modern data lakehouses using medallion architecture to address consumer needs

Medallion Architecture was conceived and has gained popularity due to its ability to handle large volumes of data and provide scalability, flexibility, and performance. The Medallion Architecture represents a logical data design pattern that facilitates progressive improvement in data quality, refinement, and preparation as data flows through distinctly identifiable layers, commonly named Bronze, Silver, and Gold. Hence, the word Medallion. The number of layers however is not set in stone. A situation may demand additional layers at the beginning or end.

This architecture, however, is not a replacement for dimensional modeling patterns. You can design schemas and tables in each layer as per the data usage objectives.

Bronze Layer:

This is the raw data layer where data from various sources is ingested without much processing. It is like the initial stage of landing data in a data lake.

Silver Layer:

In this layer, data is refined, transformed, and cleaned. This is where data is enriched, standardized, and made ready for analysis. It is a layer between raw data and fully refined data.

Gold Layer:

This is fully refined and curated data that is ready for consumption by analytics and reporting tools. It is the layer where data is optimized for query performance and is typically used for business insights.

We will be discussing considerations and data planning strategies for each layer later in this article.

Microsoft Fabric Lakehouse has all the right elements for medallion architecture

Microsoft Fabric introduced the concept of one data lake (One Lake) for the entire organization and one copy of data for use with multiple analytical engines. Within One Lake, data ownership is established with Workspaces. Data can be copied or streamed into any Workspace to a lakehouse from any of the diverse sources. Alternatively short cuts can be used to access data from internal or external sources without copying within a data lakehouse.

Data item structured or unstructured is always stored in delta parquet format. This along with delta lake implementation in Fabric ensures consumers with different persona and having distinct perspective can access same data item from different compute engines without ever having to create copies of the data item. This simplifies data governance.

A review of the Fabric lakehouse architecture would make it clear that components are in perfect alignment with medallion architecture objectives.

Raw data can be ingested into one lakehouse if all of the data belongs to a single region. If there is need for internal data protection or boundaries data can be stored over multiple bronze lakehouses.

Raw data can be processed for transformation, cleansing, enrichment, etc. using Notebooks or Dataflows that have access to all bronze lakehouse data. Data in this layer is typically not aggregated.

Transformed data can be saved in other dedicated lakehouses or better still in one or more separate workspace(s) termed as silver layer. You should consider multiple workspaces in a silver layer if different business groups are responsible for the cost of processing for their data need.

Gold data is typically directly consumed by business groups. Each group has their own expectations from data mapped to their business objectives. Because objectives may vary, it is best to isolate data for these target groups in separate silos. In Fabric, Gold layer may consist of one or more workspaces. Typically, one to many relationships works best between silver and gold datasets. Using Notebooks or Dataflows in silver workspace, data can be aggregated to its final curated form for ready consumption in Power BI or other analytical applications.

Don’t miss this opportunity to expand your knowledge and harness the full potential of Microsoft Fabric. Click here to watch our webinar.

Native support for consumer expectation from the architecture

We can seamlessly map most of the consumer-centric data architecture principles described earlier in this article with the Fabric Lakehouse architecture described above. Presentation of data through schemas relevant to target user groups in isolated workspaces or lakehouses ensured delivery of personalized data to users. Transformation from raw data to schemas suitable for different target groups happens in parallel with excellent performance guarantee because of underlying Spark cluster. The architecture can deliver transformations in scale because of the same Spark cluster. Delta lake configuration ensures that user groups with varied sets of analytical tools and applications can work on same set data as Fabric supports SQL interface as well as Lakehouse interface. Fast processing and interface of choice ensures near real time access to processed data in the preferred format. Schema validations at silver and gold lakehouses guarantee high data quality and trigger feedback loops for data quality issues originating at source or during processing.

Conclusion

To summarize, Microsoft Fabric comes with a great promise to materialize the vision of servicing consumers with data as customized product which can be used directly to achieve desired objectives. Watch out for an eBook where we would walk through steps of creating this architecture for different use cases with sample datasets.

The post Design A Consumer Centric Data Architecture with Microsoft Fabric Lakehouse appeared first on Netwoven.

A Microsoft Fabric review and why it is a big deal for Enterprise Analytics

Gjnana Prakash Duvvuri — Thu, 27 Jul 2023 09:00:12 +0000

What is Microsoft Fabric?

Microsoft Fabric establishes a unified, intelligent data foundation for all analytics workloads, seamlessly integrating Power BI, Data Factory, and the next generation of Synapse. This cohesive approach offers customers a streamlined and modern analytics solution, reducing the complexities of management.

While catering to various user personas, such as data integration engineers, data scientists, and BI professionals, each experience harmoniously coexists within a single SaaS product, minimizing integration needs and fostering enhanced collaboration.

Beyond providing an intuitive user interface, Fabric experiences share a common foundation, including Microsoft OneLake, ensuring data integrity, breaking down silos, and harnessing AI capabilities to boost productivity and insights discovery. Moreover, Fabric prioritizes security and governance, embedding industry-leading capabilities as standard features.

As data accessibility expands, Fabric empowers users with visibility into tenant activities, insights into usage and adoption, and essential tools for end-to-end data security and governance. With built-in enterprise-grade governance and compliance features, powered by Microsoft Purview, Fabric ensures robust data management practices.

After 25 years of working in data and the last 7 years in AI. I am a great supporter of Data Observability, Which I see as a foundation in Microsoft Fabric Onelake. Compared to their previous generation, it looks like this time Microsoft got their act together and put some meaningful thought processes to resolve some customer problems that may pass the test of time and that I could recommend or use to help clients.

Behind the scenes, OneLake uses Azure Data Lake Storage (ADLS) Gen2, but there is an application wrapper on top of it that handles various tasks, such as management (you don’t have to provision storage accounts or external tables), security (Power BI security is enforced), and governance. OneLake is tightly coupled with the Power BI catalog and security. In fact, you can’t create folders outside the Power BI catalog so proper catalog planning is essential. When you create a Fabric workspace in Power BI, Microsoft provisions an empty blob container in OneLake for that workspace. As you provide additional Fabric Services, Fabric creates more folders and saves data in these folders. For example, we can use Azure Storage Explorer to connect to a Fabric Lakehouse see DW and Lakehouse. Microsoft has also added two additional system folders for data staging.

What does Microsoft Fabric mean for Power BI users?

Unveiling Copilot Integration in Power BI!

Say hello to Copilot in Power BI! With Copilot seamlessly integrated into Power BI, we’re ushering in the next generation of AI to empower users to extract maximum value from their data effortlessly. By leveraging Copilot, users can simply articulate their desired visuals and insights, leaving the heavy lifting to Copilot. From crafting reports and refining DAX calculations to generating narrative summaries and posing data-related queries, all can be achieved using conversational language. Moreover, with the flexibility to tailor the tone, scope, and style of narratives, and seamlessly integrate them into reports, Power BI enhances data insights delivery through easily comprehensible text summaries, ensuring impactful communication of insights.

We’ve recently introduced quick measure suggestions for DAX, enabling analysts to swiftly generate the necessary code. The remaining features of Copilot in Power BI are currently available in private preview. Keep an eye on the Power BI blog for updates and the public release date.

Unlocking the Potential of Microsoft Fabric in Modern Advanced Analytics

In this comprehensive Ebook, we invite you to embark on a transformative journey, delving into the cutting-edge capabilities of Microsoft Fabric and OneLake.

Get the eBook

Introducing a Unified Data Foundation with OneLake and Direct Lake Mode!

Power BI is embracing open data formats by adopting Delta Lake and Parquet as its native storage format, aiming to prevent vendor lock-in and minimize data duplication and management efforts. Direct Lake mode offers exceptional performance directly from OneLake, eliminating data movement. This, coupled with the capability for other analytical engines to access and modify data within the lake, will revolutionize how business users interact with big data. Power BI datasets in Direct Lake mode deliver query performance comparable to import mode, with the real-time aspect of DirectQuery. Plus, data remains within the lake, eliminating the need for refresh management.

We’re excited to announce the preview of Direct Lake mode for Power BI datasets on Lakehouses, with plans to extend the preview to datasets on Data Warehouses soon. While Direct Lake mode datasets for Warehouses are currently in private preview, they are accessible if you use the SQL Endpoint for Lakehouse.

To experience Direct Lake mode from your Lakehouse or Warehouse in Fabric, simply click on “New Power BI Dataset,” select the desired tables, and confirm your selection. Proceed to the data model to establish measures and relationships, then create visually stunning Power BI reports. Notice the seamless integration from lake data to report creation, all within the browser and without the need for a refresh.

Webinar: Deep Dive into Fabric & OneLake with Advanced Power BI Reports. Watch Now

Enhancing Enterprise Collaboration with Git Integration for Power BI Datasets and Reports

We’re streamlining collaboration between development teams and Power BI content with our Git integration feature. By connecting your workspace to Azure DevOps repositories, you can effortlessly track changes, revert to previous versions, and consolidate updates from multiple team members into a single source of truth, synced into the workspace with just a click.

For developers, this integration offers a range of benefits:

Author report and dataset metadata files in source-control friendly formats directly within Power BI Desktop.
Save projects as Power BI projects (.PBIP) in folders rather than .PBIX files.
Facilitate multi-developer collaboration, integrate with source control for version history tracking, compare revisions, and revert to prior versions.
Employ continuous integration and continuous delivery (CI/CD) to enforce quality checks before reaching production environments.
Conduct code reviews, automated testing, and automated builds to validate deployment integrity.

Users can leverage Git integration and deployment pipelines for end-to-end application lifecycle management, developing through Git integration and deploying Power BI content across development, testing, and production workspaces. Developers can opt for the UI experience or automate processes using tools like Azure Pipelines.

End-to-End Governance Across Fabric

Fabric unifies Power BI, Synapse, and Data Factory on a single SaaS platform, enabling data teams to collaborate in a shared workspace with centralized administration, governance, and compliance tools. This includes features such as data lineage and impact analysis, data protection with sensitivity labels, data endorsement, admin monitoring, and more. The unified experiences facilitate seamless navigation between tools and collaboration among team members. Additionally, the OneLake Data Hub facilitates efficient discovery and management of data across the organization, empowering users to explore and build upon available data relevant to their business domain.

Simplified Resource Management with Universal Compute Capacities

Fabric simplifies resource management by offering a single pool of compute that powers all Fabric experiences. This approach enables customers to freely leverage all workloads without friction in their experience or commerce. The universal compute capacities reduce costs significantly, as any unused compute capacity in one workload can be utilized by others. For Power BI Premium customers, existing Power BI Premium P SKUs will automatically support all Fabric experiences. Starting June 1, new Fabric SKUs will be available for purchase in the Azure portal, granting access to these enhanced experiences.

The post A Microsoft Fabric review and why it is a big deal for Enterprise Analytics appeared first on Netwoven.

How Microsoft Fabric Helps Implement Data Observability in Your Organization

Gjnana Prakash Duvvuri — Wed, 19 Jul 2023 13:23:01 +0000

What is Data Observability

Data observability empowers organizations to grasp the vitality of their data systems, utilizing DevOps Observability principles to eradicate data downtime. Through automated monitoring, alerting, and triaging, it identifies and assesses data quality and discoverability concerns, fostering robust data pipelines, enhancing team productivity, and ensuring satisfaction among data consumers.

Five Pillars of Data Observability

Data Freshness

Assessing the timeliness and update frequency of your data tables is crucial for informed decision-making. Outdated data equals wasted time and money.

Data Distribution

Understanding the range of possible values in your data provides insights into its reliability. Knowing what to expect from your data helps establish trustworthiness.

Data Volume

The completeness of your data tables reflects the health of your data sources. Sudden changes in volume alert you to potential issues.

Data Schema

Changes in data organization can indicate data integrity issues. Tracking schema modifications and their timing is key to maintaining a healthy data ecosystem.

Data Lineage

Pinpointing where data breaks occur is essential. Lineage reveals the upstream sources and downstream users affected, along with metadata essential for governance and consistency.

Data Observability with Microsoft Fabric

Data observability is the ability of an organization to have broad visibility of its data landscape and multilayer data dependencies. It helps bridge the gaps in data governance, ensuring a well-rounded, comprehensive, and contextual approach to resolving bottlenecks and driving results.

Get the eBook

Key Components of Data Observability

Rapid Integration

Does it seamlessly integrate with your existing infrastructure without the need for pipeline modifications, coding, or specific programming languages? A swift and smooth connection ensures quicker benefits and enhanced testing coverage without substantial investments.

Security-Centric Architecture

Does it monitor data in its current storage location without necessitating data extraction? Such a solution scales across your data platform, ensuring cost-effectiveness and compliance with stringent security standards.

Simplified Setup

Does it require minimal configuration and virtually no threshold adjustments to get started? An effective data observability platform employs machine learning models to adapt to your environment and data automatically. By utilizing anomaly detection techniques, it reduces false positives by considering the broader context of your data, minimizing the need for extensive rule configuration and maintenance. Moreover, it allows you to define custom rules for critical pipelines directly within your CI/CD workflow, offering both simplicity and flexibility.

Webinar: Data Observability with Microsoft Fabric. Watch Now.

Core fundamental foundations

Conclusion

In summary, implementing data observability is crucial for organizations aiming to maximize the value and reliability of their data systems. By embracing the five pillars of data observability—data freshness, distribution, volume, schema, and lineage—and utilizing key components such as rapid integration, security-centric architecture, and simplified setup, organizations can achieve greater insights, productivity, and satisfaction among data consumers.

Microsoft Fabric, specifically Onelake, serves as a robust supporter of the data observability framework, offering core elements that benefit customers immensely. Its emphasis on time-to-value, security-first architecture, and minimal configuration aligns perfectly with the needs of modern organizations. By seamlessly integrating with existing stacks, ensuring data security at rest, and requiring minimal setup and configuration, Microsoft Fabric simplifies the path to achieving comprehensive data observability. With its inclusive and integrated approach, facilitated by one-tenant deployment, Microsoft Fabric emerges as a reliable partner in fostering a data-driven culture within organizations, ultimately driving success in today’s data-centric landscape.

The post How Microsoft Fabric Helps Implement Data Observability in Your Organization appeared first on Netwoven.

Consolidating Azure SQL Vulnerability Scan Reports Across Databases

Tushar Kanti — Wed, 01 Jul 2020 07:09:30 +0000

As business today is increasingly advancing its online presence, the associated threats for security breaches is also looming larger and organizations are constantly looking to safeguard their presence. Along with others, the threat to database has always been a key focus for ensuring the integrity of the data and in today’s world the target could very well be Azure SQL DB. Fortunately, Azure SQL Server offers a built-in solution named Vulnerability Assessment tool. Here is a link to Microsoft’s official page showcasing how it can be done using SSMS. You can also go to the Azure SQL Server blade in Azure portal and go to Advanced Data Security to enable and schedule the Vulnerability Scans and send an email to business users.

Now, to look at it, the scan reports are generated per database basis and stored in a blob storage. For an organization, data may lie across multiple databases and one needs to have a holistic view of the vulnerability across different Databases.

This is the premises of my discourse here and I will demonstrate a quick way to consolidate multiple such scan reports to create a useful Power BI dashboard.

So here we are into business and next we need to understand the scheme that we will be using to get this dashboard in place.

As you can see there are 4 parts to the overall task.

Create the Vulnerability Assessment scans
Extract the results of these scans
Put the extracted results of the scans in a reporting database
Create the reports in Power BI

Create the Vulnerability Assessment scans

As mentioned, we can create the scheduled scans either from the SQL Server Blade or we can go to the database blade. As for example, you can go to Advanced Data Security tab and go to Vulnerability Assessment page.

Here you can see the earlier Vulnerability Assessment, or you can click on scan to generate a fresh scan report.

Extract the results of these scans

Okay, so far, we got the scan report alright. To merge such scan results obtained from different DBs we need to pull in information from the Blob Storage. This activity can be divided into 3 sections

Get the connection
Get the Location of the latest vulnerability scan of the database
Get the results of the scan of the database

Get the connection

The scan results are stored in the storage account that is associated with the advanced data security in the Azure Database Server. Let us explore the storage account to get the information of our scan records.

We understand that the data is stored inside a container in Blob Storage. We will need to find the latest one and then extract the results from the json document. We will be doing this using Azure Runbook in Azure Automation Account. So let’s create an automation account if it is not already there and import two modules Az.Accounts, Az.sql, Az.Storage and any dependent modules that we will need in our Azure Powershell Runbook. We will need the Az.storage module to interact with the blob storage in Azure.

As shown in Figure above import the module from Module gallery in Azure Automation Account. It is to be noted all Azure Automation Accounts are mutually exclusive and importing a module in one does not mean it is imported in others. The modules take some time to reflect after they have been imported.

Let us create our runbook AzureSQLDBVulnerabilityReporting as a powershell runbook as is shown in the Figure below.

Now let us first create the connection in the Azure Runbook as shown below.

Get the Location of the latest vulnerability scan of the database

The cmdlet Get-AzSqlDatabaseVulnerabilityAssessmentScanRecord helps us to get all the scan Id’s of the database.

We then sort the results by Scan EndTime property and choose the last 1 to get the latest scan as is shown in the Figure above. It was really some effort to identify the properties of objects in Azure and now it should appear quite easy.

Now let us run the runbook and ensure we only get one row for the database scan. We get the expected results as is shown in the Figure above.

Get the results of the scan of the database

The Get-AzStorageBlobContent will get the content of the file stored in Azure Blob. We will need to pass the parameters required which includes the container(the main storage container) , the blob (which is the path of the blob inside the container) and context which is the context related to storage account name and storage account key as is shown in the figure.

Now it is not a good idea to share the key of the Storage in plain text inside a code. So we will use the Get-AzStorageAccountKey cmdlet to get the key of the storage account.

So we get all the parameters created for New-AzStorageContext and Get-AzStorageBlobContent. Once we get the data from the blob storage we identify the path in which it is downloaded and Get-Content of the file. The data is in Json format and hence we format it using convertfrom-json.

To extract the data from the Json, we consider two items, the status field and the severity of the rule. The Json has a lot of small documents and we need to browse the results document where the status field is ‘Finding’. Now to get the severity of the Rule we will join with Rule document and get the severity of the Rule. Once we have these two values we can either put them in a variable or store them in a table in the reporting SQL database.

So as we see above in the figure that the data is consistent with our Scans Results obtained earlier.

Put the extracted results of the scans in a reporting database

Now we can store this result in a table db_vulner in a reporting database as is shown in the figure below.

After executing the RunBook we can confirm the data in SQL Report Database as is shown in the figure below.

Create the reports in Power BI

Now we come to the final part of our activity where we show how to present the Vulnerability Scan Results that we have captured.

Open the Power BI desktop and click on connect search for Azure SQL Database and click on Connect as is shown in the figure below.

After giving the correct information for Server and Database choose the right authentication method and pass the credentials and click on connect as is shown in the figure below.

Choose the right tables and click on Load.

Once you are in the Page1 create a new Measure which will be number of vulnerabilities as is shown in the figure below.

Once you are done choose a donut or your favorite chart to show case the data as is shown in the final figure below.

The final result is one comprehensive view for the vulnerability results for all the databases. Once, you are able to fetch the results from multiple databases, you are free to create your own report whatever way you need. The dashboard becomes your one stop monitoring point for all the scan results.

Cool, isn’t it ?

Please feel free to comment or write back, I will be happy to answer any questions you might have.

The post Consolidating Azure SQL Vulnerability Scan Reports Across Databases appeared first on Netwoven.

How to Change Linked Server Name in Stored Procedures

Somnath Saha — Tue, 06 Feb 2018 21:50:30 +0000

When we restore on-premise SQL database from one environment to another environment, we may need to change linked server names which exist in the inside of source database’s stored procedures to new linked server name in the restored database. For example, if the source database name is “NetwovenDatabase” in on-premise SQL server and linked server name associated with “NetwovenDatabase” is “LinkedServer1” then after restoration of “NetwovenDatabase” to another on-premise SQL server, we need to change the associated linked server name from “LinkedServer1” to “LinkedServer2”.

We can change linked server name from “LinkedServer1” to “LinkedServer2” manually by opening procedures one by one but this is not an efficient way to do that. Hence, to do that in an efficient way we need to execute following scripts after restoring the database from one environment to another environment.

Steps:

1. Connect restored database server using SSMS with administrative privilege.

2. Connect master database.

3. Script:

Use 
SET NOCOUNT ON  
DECLARE @searchFor VARCHAR(100), @replaceWith VARCHAR(100
DECLARE @count INT 
DECLARE @i INT =1 
DECLARE @SPName VARCHAR(1000)
Declare @moddef nvarchar(max)
--Declare Table variables for storing data 
DECLARE @TStoredProcedures TABLE ( SNo INT IDENTITY(1,1),
 SPName varchar(max)
 ) 
-- text to search for  
SET @searchFor = '% LinkedServer1%'
-- text to replace with  
SET @replaceWith = ' LinkedServer2’

INSERT INTO @TStoredProcedures(SPName)
select distinct object_name(c.id) as SPName
from syscomments c, sysobjects o 
where (c.text like '%' + @ searchFor + '%' and c.text not like '%' + @replaceWith + '%')
and c.id = o.id
and o.type = 'P'


SELECT @count = COUNT(SNo) FROM @TStoredProcedures WHILE (@i <= @count)
 BEGIN
	SELECT @SPName = SPName FROM @TStoredProcedures WHERE SNo = @i
	Set @moddef =
    (SELECT
    Replace (REPLACE(definition,@searchFor,@replaceWith) ,'create ','ALTER ')
    FROM sys.sql_modules a
    JOIN 
    (   select type, name,object_id
    from sys.objects b
    where type in (
    'p' -- procedures
    )
    and is_ms_shipped = 0
    )b
    ON a.object_id=b.object_id where b.name = @SPName)
    --exec('drop procedure dbo.' + @spname)
    execute sp_executesql @moddef
	--select @SPName
    SELECT @i = @i + 1
 END

Conclusion:

Using this script, you can change the linked server name which exists in the inside of restored database’s stored procedures. Hope you will find it useful.

The post How to Change Linked Server Name in Stored Procedures appeared first on Netwoven.

Data and Analytics Archives - Netwoven

Microsoft Fabric Internals Deep Dive

Introduction

Key Differentiators and Importance

Unified Experience

Efficiency

Scalability

Strategic Impact

Let’s us deep dive into the fascinating world of Unified Data Management

1. Lakehouse Delta Lake Format with V-Order Compression and Versioning

2. Bin Compaction for Improved Performance

3. Virtual Warehouses and Serverless Pools

4. Integrated Services

5. BI Reporting and Data Science in One Platform

6. OneNote Book for Data Pipelines and Data Science

7. Zero Table Cloning

8. Data Engineering Services

9. Shortcuts with APIs for Custom Code

10. OneSecurity and Data Governance

Migrating to Fabric ​Warehouse

Let’s get into the details

1. Polaris Engine

2. Fabric Data Warehouse Features

Let’s break down the components of the above architecture diagram

1. Delta Lake

2. V-Order

3. Unified Data Lakehouse

4. Integration

Here are the key points

Let’s explore its key features

1. Real-Time Data Detection

2. Monitoring and Alerting

3. Adaptive Learning

4. Integration with Fabric Components

Conclusion

Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine

Introduction

Fabric Warehouse – Polaris Analytics Engine

Data abstraction

Polaris represents data using a “cell” abstraction with two dimensions:

Flexible assignment of cells to compute

From queries to task DAGs

The Polaris engine follows a two-phased approach for query processing:

1. Compilation using SQL Server Query Optimizer:

2. Distributed Cost-Optimization:

A task has three components:

Task Orchestration

States can be:

Migrating to Fabric ​Warehouse

Service Architecture

Polaris Architecture

Auto-Scale

Resilience to Node Failures

Hot spot recovery

Conclusion:

Thinking of Data Democratization? Microsoft Fabric Will Help You Adopt Data Mesh Culture

Introduction

Why monolithic data architecture is not supporting big enterprises anymore?

We can summarize the challenges of enterprises having a monolithic data architecture as below:

How Data Mesh i.e., domain centric federated data management is helping big enterprises?

Data Mesh is founded on four principles:

Domain Oriented Ownership

Data as a Product

Self-serve Data Platform

Federated Computational Governance

What kind of challenges enterprises face while adopting domain centric data management?

Treat Your Data as Product with Microsoft Fabric

Microsoft Fabric will help ease transition challenges

What is your next step towards data democratization?

Microsoft Fabric vs Synapse: A Comparative Study

Introduction:

Azure Synapse vs Fabric

Unlocking the Potential of Microsoft Fabric in Modern Advanced Analytics

Synapse vs Fabric: How do they differ from one another?

Conclusion

Data Mesh vs. Data Fabric – Navigating Modern Data Architecture

Introduction:

What is Data Mesh

What is Data Fabric

Data Fabric vs Data Mesh: What’s the difference?

Migrating to Fabric Warehouse

Migrating to Fabric Warehouse

Data access: APIs vs-controlled datasets