- Data Engineering - The process of designing, building, and maintaining systems for the collection, storage, and analysis of data.
- ETL (Extract, Transform, Load) - Process of extracting data from various sources, transforming it into a usable format, and loading it into a target destination.
- Data Pipeline - Automated process for moving data from one system to another.
- Data Lake - Centralized repository for storing structured and unstructured data at any scale.
- Data Warehouse - Centralized repository for structured data used for reporting and analysis.
- Data Mart - Subset of a data warehouse focused on a specific business area or department.
- OLTP (Online Transaction Processing) - System for managing transaction-oriented applications.
- OLAP (Online Analytical Processing) - System for analyzing and querying multidimensional data.
- Batch Processing - Processing data in fixed-size batches at regular intervals.
- Real-Time Processing - Processing data immediately as it arrives.
- Streaming Data - Continuous flow of data generated from various sources.
- Data Ingestion - Process of importing data from external sources into a storage system.
- Data Wrangling - Process of cleaning, structuring, and enriching raw data for analysis.
- Data Governance - Framework for managing data assets and ensuring data quality, security, and compliance.
- Data Quality - Measure of the accuracy, completeness, and reliability of data.
- Data Profiling - Analyzing data to understand its structure, content, and quality.
- Data Catalog - Centralized inventory of data assets and metadata.
- Data Lineage - Record of the origin and movement of data through a system.
- Data Masking - Technique for obfuscating sensitive data to protect privacy.
- Data Anonymization - Process of removing personally identifiable information from data.
- Data Encryption - Method of encoding data to prevent unauthorized access.
- Data Compression - Technique for reducing the size of data to save storage space and bandwidth.
- Data Archiving - Moving data to long-term storage for retention purposes.
- Data Modeling - Process of designing the structure and relationships of data in a database.
- Relational Database - Database structured around tables and relationships between data.
- NoSQL Database - Database designed for storing and retrieving unstructured or semi-structured data.
- Document Database - NoSQL database that stores data in JSON or BSON documents.
- Key-Value Store - NoSQL database that stores data as a collection of key-value pairs.
- Column-Family Store - NoSQL database that stores data in columns rather than rows.
- Graph Database - NoSQL database optimized for storing and querying graph data.
- Time-Series Database - Database optimized for storing and querying time-series data.
- Data Partitioning - Dividing data into smaller subsets to improve performance and scalability.
- Data Replication - Copying data to multiple locations for redundancy and fault tolerance.
- Data Sharding - Distributing data across multiple servers or nodes to improve performance.
- Data Consistency - Ensuring that data remains accurate and up-to-date across multiple sources.
- CAP Theorem - Theoretical framework for understanding distributed systems trade-offs between consistency, availability, and partition tolerance.
- ACID (Atomicity, Consistency, Isolation, Durability) - Properties of database transactions.
- BASE (Basically Available, Soft state, Eventually consistent) - Alternative to ACID for distributed systems.
- Data Warehouse Architecture - Design and structure of a data warehouse system.
- Star Schema - Data warehouse schema consisting of a central fact table and multiple dimension tables.
- Snowflake Schema - Variation of star schema where dimension tables are normalized.
- Fact Table - Table in a star schema that contains metrics or measurements.
- Dimension Table - Table in a star schema that contains descriptive attributes.
- Surrogate Key - Artificial primary key used to uniquely identify records in a table.
- Slowly Changing Dimension (SCD) - Dimension that changes slowly over time.
- ETL Tool - Software for designing, building, and managing ETL processes.
- Data Integration - Combining data from different sources into a unified view.
- Master Data Management (MDM) - Process of managing and ensuring the quality of critical data across an organization.
- Data Governance Council - Group responsible for establishing and enforcing data governance policies and procedures.
- Data Steward - Individual responsible for managing and maintaining data assets.
- Data Dictionary - Repository of data definitions and metadata.
- Data Lake Architecture - Design and structure of a data lake system.
- Lambda Architecture - Hybrid architecture for processing both batch and real-time data.
- Kappa Architecture - Simplified version of Lambda architecture that only processes real-time data.
- Data Mesh - Architecture paradigm for decentralizing data ownership and management.
- Data Pipeline Framework - Software framework for building and managing data pipelines.
- DAG (Directed Acyclic Graph) - Data pipeline topology where nodes represent tasks and edges represent dependencies.
- Workflow Orchestration - Coordination and automation of tasks in a data pipeline.
- Data Versioning - Managing different versions of data to track changes over time.
- Data Lake Governance - Policies and processes for managing data lakes and ensuring data quality and security.
- Data Lake Security - Measures to protect data lakes from unauthorized access, misuse, and breaches.
- Data Warehouse Optimization - Techniques for improving the performance and efficiency of data warehouses.
- Data Warehouse Automation - Process of automating the design, development, and maintenance of data warehouses.
- Data Vault Modeling - Data modeling technique optimized for data warehouses and ETL processes.
- Data Mesh Governance - Policies and processes for governing decentralized data mesh architectures.
- Data Mesh Security - Security measures for protecting decentralized data mesh architectures.
- Data Engineering Team - Group of professionals responsible for designing, building, and maintaining data infrastructure and pipelines.
- Data Engineering Manager - Leader responsible for overseeing data engineering projects and teams.
- Data Engineer - Professional responsible for designing, building, and maintaining data pipelines and infrastructure.
- Data Architect - Professional responsible for designing and optimizing data architectures and systems.
- Data Analyst - Professional responsible for analyzing and interpreting data to inform business decisions.
- Data Scientist - Professional responsible for analyzing complex datasets and deriving insights using statistical and machine learning techniques.
- Big Data - Term used to describe large and complex datasets that cannot be processed using traditional data processing techniques.
- Data Lakehouse - Hybrid architecture combining elements of data lakes and data warehouses.
- Data Engineering Framework - Methodology or approach for designing and building data engineering solutions.
- Data Lakehouse - Architecture that combines the best features of data lakes and data warehouses for storing and analyzing large volumes of structured and unstructured data.
- Polyglot Persistence - Strategy of using multiple data storage technologies to handle different types of data within the same application.
- Data Mesh - Decentralized approach to data architecture that treats data as a product and emphasizes domain-driven design principles.
- Data Fabric - Unified architecture that enables seamless access to data across distributed environments and heterogeneous data sources.
- DataOps - Agile methodology for managing the entire data lifecycle, including development, deployment, and operations.
- ModelOps - DevOps-like approach to managing machine learning models throughout their lifecycle, from development to deployment and monitoring.
- Feature Store - Centralized repository for storing, managing, and sharing machine learning features for model training and deployment.
- Streaming Data Processing - Real-time analysis of continuous streams of data for immediate insights and action.
- Complex Event Processing (CEP) - Technique for analyzing and correlating events from multiple sources to identify patterns or anomalies in real-time.
- Data Ingestion Framework - Set of tools and processes for collecting, transforming, and loading data from various sources into a data storage system.
- Data Governance Framework - Structured approach to managing data assets, ensuring data quality, and enforcing data policies and regulations.
- Data Stewardship - Responsibility for managing and maintaining the quality, integrity, and security of data within an organization.
- Data Lineage Analysis - Examination of the origins, transformations, and movements of data throughout its lifecycle to ensure accuracy and compliance.
- Data Mesh Architecture - Distributed architecture that decentralizes data ownership and processing responsibilities while providing standardized access and governance.
- Data Vault Modeling - Methodology for modeling data warehouses that emphasizes flexibility, scalability, and auditability.
- Real-Time Analytics - Analysis of data as it is generated to derive immediate insights and make timely decisions.
- Near Real-Time Analytics - Analysis of data with minimal delay, typically within seconds or minutes of its generation.
- Lambda Architecture - Hybrid architecture for processing both batch and real-time data streams to provide accurate and timely insights.
- Kappa Architecture - Simplified variant of the Lambda architecture that uses stream processing exclusively for data processing.
- Data Orchestration - Coordination of data workflows and processes to ensure data is collected, processed, and delivered efficiently and reliably.
- Data Virtualization - Technique for abstracting and combining data from multiple sources to provide a unified view without physically moving or copying the data.
- Data Mesh Governance - Framework for managing data governance in a decentralized data architecture, ensuring consistency, compliance, and collaboration.
- Data Cataloging - Automated process of indexing, organizing, and documenting metadata and data assets for easy discovery and analysis.
- Data Profiling - Automated analysis of data to assess its quality, completeness, accuracy, and consistency.
- Data Wrangling - Process of cleaning, structuring, and preparing raw data for analysis and modeling.
Terminology
Data Engineering Terminology
Read Data Engineering Terminology