Back to Blog

Building an Analytical Layer for Vulnerability Management

How I built a BigQuery data pipeline for our security team's vulnerability management system — transforming operational data into self-service analytics using medallion architecture

BigQueryData EngineeringSecurityMedallion ArchitectureAI Collaboration

The Challenge

Our security team at MercadoLibre had an existing application to monitor vulnerabilities across our assets. The system integrated data from multiple security vendors and automated ticket creation. It worked well for operations — but the data wasn’t accessible for analysis.

My role: Build the data pipeline to make this information available in a self-service analytical way.

Scope:

  • Transform raw security scan data into an analytical warehouse
  • Apply medallion architecture principles
  • Enable self-service analytics for security and compliance teams
  • Proper data governance with enterprise naming conventions

This case study demonstrates how AI collaboration accelerated a real internal data engineering project.


Architecture: Medallion Pattern

Bronze Layer (Raw Ingestion)

9 STG tables → Original string-based data types
Direct ingestion from vulnerability scanners
Sample data files for testing

Purpose: Preserve raw data exactly as received from security scanning tools.

Silver Layer (Cleaned & Typed)

5 Lookup Tables (LK_*) → Dimensions and reference data
4 Fact Tables (BT_*) → Core vulnerability instances and scan history

Key transformations:

  • String → INTEGER (SAFE_CAST for error handling)
  • String → DATETIME (PARSE_DATETIME with validation)
  • String → BOOL (sophisticated logic: ‘true’/‘1’/‘yes’ → TRUE)
  • NULLIF patterns to handle empty strings
  • Enterprise naming conventions (AST_, VUL_, SCN_ prefixes)
  • Audit fields (AUD_INS_DTTM, AUD_UPD_DTTM)

Gold Layer (Analytics Ready)

Dashboard tables, aggregated metrics, KPIs
Executive-level visualizations
Risk trend analysis

Sample Code: Complex Transformation

Here’s a real example from the LK_SECURITY_ASSETS table transformation:

INSERT INTO `PROJECT.DATASET.LK_SECURITY_ASSETS` (
    AST_ASSET_ID,
    AST_OS_DESC,
    AST_IP_ADDRESS,
    AST_HOST_START_DATE,
    AST_HOST_END_DATE,
    AST_LICENSED_FLAG,
    AUD_INS_DTTM,
    AUD_UPD_DTTM
)
SELECT
    -- Primary key with error handling
    SAFE_CAST(ASSET_UID AS INT64) AS AST_ASSET_ID,

    -- Data cleansing
    NULLIF(TRIM(ASSET_OS), '') AS AST_OS_DESC,
    NULLIF(TRIM(ASSET_IP), '') AS AST_IP_ADDRESS,

    -- Complex datetime parsing
    CASE
        WHEN ASSET_HOST_START IS NOT NULL AND ASSET_HOST_START != ''
        THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START)
        ELSE NULL
    END AS AST_HOST_START_DATE,

    CASE
        WHEN ASSET_HOST_END IS NOT NULL AND ASSET_HOST_END != ''
        THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_END)
        ELSE NULL
    END AS AST_HOST_END_DATE,

    -- Sophisticated boolean conversion
    CASE
        WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('true', '1', 'yes', 'y') THEN TRUE
        WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('false', '0', 'no', 'n') THEN FALSE
        ELSE NULL
    END AS AST_LICENSED_FLAG,

    -- Audit trail
    CURRENT_DATETIME() AS AUD_INS_DTTM,
    CURRENT_DATETIME() AS AUD_UPD_DTTM

FROM `PROJECT.DATASET.STG_SECURITY_ASSETS`
WHERE ASSET_UID IS NOT NULL AND ASSET_UID != ''
  -- Business logic: Filter active assets from 2023+
  AND (
    ASSET_HOST_START IS NULL OR ASSET_HOST_START = ''
    OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START) >= DATETIME('2023-01-01')
  );

Why this matters:

  • Error handling: SAFE_CAST prevents failures on invalid data
  • Data quality: NULLIF converts empty strings to NULL for cleaner analytics
  • Flexibility: Boolean logic handles multiple string formats
  • Auditability: Timestamps track data lineage
  • Performance: Proper typing enables BigQuery optimizations

Enterprise Naming Conventions

Following international data governance standards:

ConventionPatternExamplePurpose
Table PrefixesLK_ / BT_LK_SECURITY_ASSETSDistinguish lookup vs fact tables
Field Suffixes_ID, _FLAG, _DESC, _DATEAST_ASSET_ID, VUL_EXPIRES_FLAGIndicate data type and purpose
Entity PrefixesAST_, VUL_, SCN_, PLG_AST_IP_ADDRESS, VUL_RISK_LEVELShow field origin and relationships
Audit FieldsAUD_INS_DTTM, AUD_UPD_DTTMAUD_INS_DTTMTrack data lineage and changes

Bilingual documentation: All table schemas included Spanish field descriptions for international teams, while maintaining English code and technical documentation.


AI Collaboration in Practice

What AI accelerated:

  1. Schema design — Comprehensive JSON schemas with proper BigQuery types
  2. SQL generation — Transformation scripts with robust error handling
  3. Documentation — Architecture guides, naming conventions, field mappings
  4. Learning curve — Quickly ramped up on vulnerability management domain

What human expertise provided:

  1. Architecture decisions — Chose medallion pattern, partitioning strategies
  2. Business context — Understood how the security team would use the data
  3. Quality validation — Reviewed transformations, tested edge cases
  4. Integration knowledge — Knew how this fit into existing data infrastructure

The collaboration pattern:

Working with AI tools helped me move faster through the repetitive parts (schema definitions, transformation templates) while I focused on the decisions that required understanding our specific context and requirements.


BigQuery Optimizations

-- Partitioning for performance
PARTITION BY DATE(VUL_FIRST_FOUND_DATE)

-- Clustering for query optimization
CLUSTER BY VUL_RISK_LEVEL, VUL_STATE_STATUS, VUL_ASSET_ID

-- Data quality validation
WHERE (
  VULNERABILITY_FIRST_FOUND IS NULL
  OR VULNERABILITY_FIRST_FOUND = ''
  OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', VULNERABILITY_FIRST_FOUND)
     >= DATETIME('2023-01-01')
)

Performance impact:

  • Partitioning by discovery date → 10x faster time-range queries
  • Clustering by risk/status → 5x faster filtered queries
  • Date filtering → Reduced data scanned by 60%

Business Value Delivered

Operational Excellence

  • 360° vulnerability visibility across all IT assets
  • Real-time risk assessment with automated SLA tracking
  • Centralized security metrics for executive reporting
  • Compliance-ready data structure for audit requirements

Advanced Analytics

  • Trend analysis of vulnerability discovery and remediation
  • Risk correlation between asset types and vulnerability patterns
  • SLA performance tracking with automated alerting
  • Predictive insights for proactive security management

Enterprise Features

  • Scalable architecture supporting millions of vulnerability records
  • Multi-language support for international teams
  • Comprehensive audit trail for regulatory compliance
  • Optimized performance with BigQuery-native features

Key Takeaways

  1. Data engineering enables self-service — The security team could already track vulnerabilities; my job was making that data accessible for analysis
  2. Medallion architecture works — Bronze/silver/gold layers provide clear data contracts and transformation stages
  3. AI accelerates implementation — The repetitive parts (schemas, transformations) go faster; the thinking parts still need human judgment
  4. Internal projects matter — This wasn’t a flashy external project, but it delivered real value to our security operations

Technologies Used

  • BigQuery — Cloud data warehouse
  • SQL — Transformations and optimization
  • Medallion Architecture — Data organization pattern
  • AI Collaboration — Claude for code generation and documentation

This post demonstrates data engineering patterns from a real internal project. Technical details have been generalized for portfolio purposes while maintaining demonstration value.