Curso Databricks Performance Optimization

  • Tableau Data Visualization

Curso Databricks Performance Optimization

24 horas
Visão Geral

Este curso foi desenvolvido para capacitar profissionais a identificar, analisar e otimizar gargalos de desempenho em ambientes Databricks. Os participantes aprenderão técnicas avançadas para melhorar a performance de workloads de engenharia de dados, analytics, machine learning e processamento de grandes volumes de dados utilizando Apache Spark, Delta Lake e a arquitetura Lakehouse.

O treinamento aborda desde os fundamentos da execução distribuída do Spark até estratégias avançadas de otimização de consultas, gerenciamento de arquivos, particionamento, caching, cluster tuning, Photon Engine e monitoramento de workloads produtivos.

Objetivo

Após realizar este curso Databricks Performance Optimization, você será capaz de:

  • Compreender o funcionamento interno do Apache Spark
  • Interpretar planos de execução e métricas de performance
  • Identificar gargalos de processamento
  • Otimizar consultas SQL e DataFrames
  • Aplicar técnicas avançadas de particionamento
  • Utilizar corretamente Delta Lake Optimization
  • Configurar clusters para máxima eficiência
  • Reduzir custos operacionais de workloads
  • Melhorar pipelines batch e streaming
  • Implementar boas práticas de performance em ambientes produtivos
Publico Alvo
  • Data Engineers
  • Data Architects
  • Analytics Engineers
  • Spark Developers
  • Data Scientists
  • Cloud Engineers
  • Administradores de Plataforma Databricks
  • Profissionais que desejam otimizar ambientes Databricks em produção
Pre-Requisitos
  • Conhecimentos de Apache Spark
  • Conhecimentos básicos de Databricks
  • Experiência com SQL
  • Conhecimentos de Delta Lake
  • Conceitos de Data Lake e Data Warehouse
  • Familiaridade com ambientes em nuvem
Materiais
Inglês/Português + Exercícios + Lab Pratico
Conteúdo Programatico

Module 1: Introduction to Databricks Performance Optimization

  1. Performance Optimization Overview
  2. Databricks Lakehouse Architecture
  3. Apache Spark Execution Model
  4. Distributed Computing Concepts
  5. Performance Bottlenecks
  6. Optimization Methodology
  7. Cost versus Performance

Module 2: Understanding Spark Internals

  1. Spark Architecture
  2. Driver and Executors
  3. DAG Execution
  4. Stages and Tasks
  5. Shuffle Operations
  6. Memory Management
  7. Execution Lifecycle

Module 3: Query Execution Analysis

  1. Query Planning Process
  2. Catalyst Optimizer
  3. Physical Execution Plans
  4. Explain Commands
  5. Adaptive Query Execution
  6. Query Metrics Analysis
  7. Troubleshooting Slow Queries

Module 4: SQL Performance Tuning

  1. Efficient SQL Design
  2. Predicate Pushdown
  3. Projection Pruning
  4. Join Optimization
  5. Aggregation Optimization
  6. Window Function Performance
  7. SQL Best Practices

Module 5: DataFrame Performance Optimization

  1. Efficient DataFrame Operations
  2. Transformation Optimization
  3. Avoiding Expensive Operations
  4. Lazy Evaluation
  5. Caching Strategies
  6. Persist Techniques
  7. Code Optimization Patterns

Module 6: Delta Lake Optimization

  1. Delta Lake Architecture
  2. File Layout Optimization
  3. OPTIMIZE Command
  4. ZORDER Optimization
  5. VACUUM Operations
  6. Data Skipping
  7. Delta Best Practices

Module 7: Data Layout and Partitioning Strategies

  1. Partitioning Fundamentals
  2. Partition Design
  3. Over-Partitioning Issues
  4. Under-Partitioning Issues
  5. Bucketing Concepts
  6. File Size Optimization
  7. Storage Performance Tuning

Module 8: Cluster Performance Optimization

  1. Cluster Architecture
  2. Cluster Sizing
  3. Autoscaling Configuration
  4. Worker Node Selection
  5. Driver Optimization
  6. Resource Utilization Monitoring
  7. Cost Optimization Techniques

Module 9: Photon Engine Optimization

  1. Photon Architecture
  2. Vectorized Execution
  3. Workload Compatibility
  4. Query Acceleration
  5. Performance Benchmarks
  6. Monitoring Photon Usage
  7. Best Practices

Module 10: Streaming Performance Optimization

  1. Structured Streaming Internals
  2. Trigger Configuration
  3. Checkpoint Optimization
  4. State Management
  5. Watermarking Optimization
  6. Streaming Metrics Analysis
  7. Troubleshooting Streaming Workloads

Module 11: Monitoring and Observability

  1. Spark UI Analysis
  2. Databricks Metrics
  3. Ganglia Monitoring
  4. Job Performance Analysis
  5. Cluster Metrics
  6. Event Logs
  7. Root Cause Analysis

Module 12: Production Performance Best Practices

  1. End-to-End Optimization Strategy
  2. Workload Segmentation
  3. Resource Governance
  4. CI/CD Performance Validation
  5. Capacity Planning
  6. Performance Testing
  7. Operational Excellence

Laboratórios Práticos

Lab 1: Spark Execution Analysis

  1. Analyze DAG Execution
  2. Review Stages and Tasks
  3. Identify Bottlenecks
  4. Interpret Spark UI Metrics

Lab 2: SQL Query Optimization

  1. Analyze Query Plans
  2. Optimize Slow Queries
  3. Compare Execution Times
  4. Validate Improvements

Lab 3: DataFrame Optimization

  1. Refactor Inefficient Transformations
  2. Implement Caching
  3. Optimize Joins
  4. Reduce Shuffle Operations

Lab 4: Delta Lake Optimization

  1. Execute OPTIMIZE Operations
  2. Implement ZORDER
  3. Analyze File Distribution
  4. Improve Query Performance

Lab 5: Partitioning Strategies

  1. Create Partitioned Tables
  2. Test Different Partition Designs
  3. Analyze Data Skipping
  4. Compare Query Performance

Lab 6: Cluster Optimization

  1. Resize Clusters
  2. Configure Autoscaling
  3. Analyze Resource Usage
  4. Optimize Compute Costs

Lab 7: Photon Performance Benchmark

  1. Enable Photon Engine
  2. Compare Execution Results
  3. Analyze Query Acceleration
  4. Measure Cost Savings

Lab 8: Streaming Optimization

  1. Tune Streaming Pipelines
  2. Optimize State Management
  3. Configure Watermarks
  4. Monitor Streaming Performance

Lab 9: Production Troubleshooting Workshop

  1. Diagnose Real Performance Issues
  2. Analyze Logs and Metrics
  3. Apply Optimization Techniques
  4. Validate Improvements

Lab 10: End-to-End Optimization Project

  1. Analyze Existing Data Platform
  2. Identify Performance Problems
  3. Optimize Data Layout
  4. Tune SQL Workloads
  5. Optimize Data Pipelines
  6. Improve Cluster Configuration
  7. Implement Monitoring
  8. Produce Performance Assessment Report

Projeto Final

Realização de um assessment completo de performance em um ambiente Databricks, incluindo análise de workloads, otimização de consultas SQL, tuning de Delta Lake, configuração de clusters, redução de custos operacionais e implementação de monitoramento contínuo para ambientes corporativos de grande escala.

TENHO INTERESSE

Cursos Relacionados

Curso Análise de Dados Com o Power BI - 20778B

24 horas

Curso Análise de dados Excel Com Power BI - 20779B

16 horas

Curso Talend Data Integration Foundation

16 horas

Curso Talend Data Integration Advanced

16 horas

Curso PowerApps with SAP Integration

24 horas