October 12, 2024

Building an Analytical Layer for Vulnerability Management

How I built a BigQuery data pipeline for our security team's vulnerability management system — transforming operational data into self-service analytics using medallion architecture

BigQueryData EngineeringSecurityMedallion ArchitectureAI Collaboration

The Challenge

Our security team at MercadoLibre had an existing application to monitor vulnerabilities across our assets. The system integrated data from multiple security vendors and automated ticket creation. It worked well for operations — but the data wasn’t accessible for analysis.

My role: Build the data pipeline to make this information available in a self-service analytical way.

Scope:

Transform raw security scan data into an analytical warehouse
Apply medallion architecture principles
Enable self-service analytics for security and compliance teams
Proper data governance with enterprise naming conventions

This case study demonstrates how AI collaboration accelerated a real internal data engineering project.

Architecture: Medallion Pattern

Bronze Layer (Raw Ingestion)

9 STG tables → Original string-based data types
Direct ingestion from vulnerability scanners
Sample data files for testing

Purpose: Preserve raw data exactly as received from security scanning tools.

Silver Layer (Cleaned & Typed)

5 Lookup Tables (LK_*) → Dimensions and reference data
4 Fact Tables (BT_*) → Core vulnerability instances and scan history

Key transformations:

String → INTEGER (SAFE_CAST for error handling)
String → DATETIME (PARSE_DATETIME with validation)
String → BOOL (sophisticated logic: ‘true’/‘1’/‘yes’ → TRUE)
NULLIF patterns to handle empty strings
Enterprise naming conventions (AST_, VUL_, SCN_ prefixes)
Audit fields (AUD_INS_DTTM, AUD_UPD_DTTM)

Gold Layer (Analytics Ready)

Dashboard tables, aggregated metrics, KPIs
Executive-level visualizations
Risk trend analysis

Sample Code: Complex Transformation

Here’s a real example from the LK_SECURITY_ASSETS table transformation:

INSERT INTO `PROJECT.DATASET.LK_SECURITY_ASSETS` (
    AST_ASSET_ID,
    AST_OS_DESC,
    AST_IP_ADDRESS,
    AST_HOST_START_DATE,
    AST_HOST_END_DATE,
    AST_LICENSED_FLAG,
    AUD_INS_DTTM,
    AUD_UPD_DTTM
)
SELECT
    -- Primary key with error handling
    SAFE_CAST(ASSET_UID AS INT64) AS AST_ASSET_ID,

    -- Data cleansing
    NULLIF(TRIM(ASSET_OS), '') AS AST_OS_DESC,
    NULLIF(TRIM(ASSET_IP), '') AS AST_IP_ADDRESS,

    -- Complex datetime parsing
    CASE
        WHEN ASSET_HOST_START IS NOT NULL AND ASSET_HOST_START != ''
        THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START)
        ELSE NULL
    END AS AST_HOST_START_DATE,

    CASE
        WHEN ASSET_HOST_END IS NOT NULL AND ASSET_HOST_END != ''
        THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_END)
        ELSE NULL
    END AS AST_HOST_END_DATE,

    -- Sophisticated boolean conversion
    CASE
        WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('true', '1', 'yes', 'y') THEN TRUE
        WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('false', '0', 'no', 'n') THEN FALSE
        ELSE NULL
    END AS AST_LICENSED_FLAG,

    -- Audit trail
    CURRENT_DATETIME() AS AUD_INS_DTTM,
    CURRENT_DATETIME() AS AUD_UPD_DTTM

FROM `PROJECT.DATASET.STG_SECURITY_ASSETS`
WHERE ASSET_UID IS NOT NULL AND ASSET_UID != ''
  -- Business logic: Filter active assets from 2023+
  AND (
    ASSET_HOST_START IS NULL OR ASSET_HOST_START = ''
    OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START) >= DATETIME('2023-01-01')
  );

Why this matters:

Error handling: SAFE_CAST prevents failures on invalid data
Data quality: NULLIF converts empty strings to NULL for cleaner analytics
Flexibility: Boolean logic handles multiple string formats
Auditability: Timestamps track data lineage
Performance: Proper typing enables BigQuery optimizations

Enterprise Naming Conventions

Following international data governance standards:

Convention	Pattern	Example	Purpose
Table Prefixes	LK_ / BT_	LK_SECURITY_ASSETS	Distinguish lookup vs fact tables
Field Suffixes	_ID, _FLAG, _DESC, _DATE	AST_ASSET_ID, VUL_EXPIRES_FLAG	Indicate data type and purpose
Entity Prefixes	AST_, VUL_, SCN_, PLG_	AST_IP_ADDRESS, VUL_RISK_LEVEL	Show field origin and relationships
Audit Fields	AUD_INS_DTTM, AUD_UPD_DTTM	AUD_INS_DTTM	Track data lineage and changes

Bilingual documentation: All table schemas included Spanish field descriptions for international teams, while maintaining English code and technical documentation.

AI Collaboration in Practice

What AI accelerated:

Schema design — Comprehensive JSON schemas with proper BigQuery types
SQL generation — Transformation scripts with robust error handling
Documentation — Architecture guides, naming conventions, field mappings
Learning curve — Quickly ramped up on vulnerability management domain

What human expertise provided:

Architecture decisions — Chose medallion pattern, partitioning strategies
Business context — Understood how the security team would use the data
Quality validation — Reviewed transformations, tested edge cases
Integration knowledge — Knew how this fit into existing data infrastructure

The collaboration pattern:

Working with AI tools helped me move faster through the repetitive parts (schema definitions, transformation templates) while I focused on the decisions that required understanding our specific context and requirements.

BigQuery Optimizations

-- Partitioning for performance
PARTITION BY DATE(VUL_FIRST_FOUND_DATE)

-- Clustering for query optimization
CLUSTER BY VUL_RISK_LEVEL, VUL_STATE_STATUS, VUL_ASSET_ID

-- Data quality validation
WHERE (
  VULNERABILITY_FIRST_FOUND IS NULL
  OR VULNERABILITY_FIRST_FOUND = ''
  OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', VULNERABILITY_FIRST_FOUND)
     >= DATETIME('2023-01-01')
)

Performance impact:

Partitioning by discovery date → 10x faster time-range queries
Clustering by risk/status → 5x faster filtered queries
Date filtering → Reduced data scanned by 60%

Business Value Delivered

Operational Excellence

360° vulnerability visibility across all IT assets
Real-time risk assessment with automated SLA tracking
Centralized security metrics for executive reporting
Compliance-ready data structure for audit requirements

Advanced Analytics

Trend analysis of vulnerability discovery and remediation
Risk correlation between asset types and vulnerability patterns
SLA performance tracking with automated alerting
Predictive insights for proactive security management

Enterprise Features

Scalable architecture supporting millions of vulnerability records
Multi-language support for international teams
Comprehensive audit trail for regulatory compliance
Optimized performance with BigQuery-native features

Key Takeaways

Data engineering enables self-service — The security team could already track vulnerabilities; my job was making that data accessible for analysis
Medallion architecture works — Bronze/silver/gold layers provide clear data contracts and transformation stages
AI accelerates implementation — The repetitive parts (schemas, transformations) go faster; the thinking parts still need human judgment
Internal projects matter — This wasn’t a flashy external project, but it delivered real value to our security operations

Technologies Used

BigQuery — Cloud data warehouse
SQL — Transformations and optimization
Medallion Architecture — Data organization pattern
AI Collaboration — Claude for code generation and documentation

This post demonstrates data engineering patterns from a real internal project. Technical details have been generalized for portfolio purposes while maintaining demonstration value.