Building an Analytical Layer for Vulnerability Management
How I built a BigQuery data pipeline for our security team's vulnerability management system — transforming operational data into self-service analytics using medallion architecture
The Challenge
Our security team at MercadoLibre had an existing application to monitor vulnerabilities across our assets. The system integrated data from multiple security vendors and automated ticket creation. It worked well for operations — but the data wasn’t accessible for analysis.
My role: Build the data pipeline to make this information available in a self-service analytical way.
Scope:
- Transform raw security scan data into an analytical warehouse
- Apply medallion architecture principles
- Enable self-service analytics for security and compliance teams
- Proper data governance with enterprise naming conventions
This case study demonstrates how AI collaboration accelerated a real internal data engineering project.
Architecture: Medallion Pattern
Bronze Layer (Raw Ingestion)
9 STG tables → Original string-based data types
Direct ingestion from vulnerability scanners
Sample data files for testing
Purpose: Preserve raw data exactly as received from security scanning tools.
Silver Layer (Cleaned & Typed)
5 Lookup Tables (LK_*) → Dimensions and reference data
4 Fact Tables (BT_*) → Core vulnerability instances and scan history
Key transformations:
- String → INTEGER (SAFE_CAST for error handling)
- String → DATETIME (PARSE_DATETIME with validation)
- String → BOOL (sophisticated logic: ‘true’/‘1’/‘yes’ → TRUE)
- NULLIF patterns to handle empty strings
- Enterprise naming conventions (AST_, VUL_, SCN_ prefixes)
- Audit fields (AUD_INS_DTTM, AUD_UPD_DTTM)
Gold Layer (Analytics Ready)
Dashboard tables, aggregated metrics, KPIs
Executive-level visualizations
Risk trend analysis
Sample Code: Complex Transformation
Here’s a real example from the LK_SECURITY_ASSETS table transformation:
INSERT INTO `PROJECT.DATASET.LK_SECURITY_ASSETS` (
AST_ASSET_ID,
AST_OS_DESC,
AST_IP_ADDRESS,
AST_HOST_START_DATE,
AST_HOST_END_DATE,
AST_LICENSED_FLAG,
AUD_INS_DTTM,
AUD_UPD_DTTM
)
SELECT
-- Primary key with error handling
SAFE_CAST(ASSET_UID AS INT64) AS AST_ASSET_ID,
-- Data cleansing
NULLIF(TRIM(ASSET_OS), '') AS AST_OS_DESC,
NULLIF(TRIM(ASSET_IP), '') AS AST_IP_ADDRESS,
-- Complex datetime parsing
CASE
WHEN ASSET_HOST_START IS NOT NULL AND ASSET_HOST_START != ''
THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START)
ELSE NULL
END AS AST_HOST_START_DATE,
CASE
WHEN ASSET_HOST_END IS NOT NULL AND ASSET_HOST_END != ''
THEN PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_END)
ELSE NULL
END AS AST_HOST_END_DATE,
-- Sophisticated boolean conversion
CASE
WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('true', '1', 'yes', 'y') THEN TRUE
WHEN LOWER(TRIM(ASSET_LICENSED)) IN ('false', '0', 'no', 'n') THEN FALSE
ELSE NULL
END AS AST_LICENSED_FLAG,
-- Audit trail
CURRENT_DATETIME() AS AUD_INS_DTTM,
CURRENT_DATETIME() AS AUD_UPD_DTTM
FROM `PROJECT.DATASET.STG_SECURITY_ASSETS`
WHERE ASSET_UID IS NOT NULL AND ASSET_UID != ''
-- Business logic: Filter active assets from 2023+
AND (
ASSET_HOST_START IS NULL OR ASSET_HOST_START = ''
OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', ASSET_HOST_START) >= DATETIME('2023-01-01')
);
Why this matters:
- Error handling: SAFE_CAST prevents failures on invalid data
- Data quality: NULLIF converts empty strings to NULL for cleaner analytics
- Flexibility: Boolean logic handles multiple string formats
- Auditability: Timestamps track data lineage
- Performance: Proper typing enables BigQuery optimizations
Enterprise Naming Conventions
Following international data governance standards:
| Convention | Pattern | Example | Purpose |
|---|---|---|---|
| Table Prefixes | LK_ / BT_ | LK_SECURITY_ASSETS | Distinguish lookup vs fact tables |
| Field Suffixes | _ID, _FLAG, _DESC, _DATE | AST_ASSET_ID, VUL_EXPIRES_FLAG | Indicate data type and purpose |
| Entity Prefixes | AST_, VUL_, SCN_, PLG_ | AST_IP_ADDRESS, VUL_RISK_LEVEL | Show field origin and relationships |
| Audit Fields | AUD_INS_DTTM, AUD_UPD_DTTM | AUD_INS_DTTM | Track data lineage and changes |
Bilingual documentation: All table schemas included Spanish field descriptions for international teams, while maintaining English code and technical documentation.
AI Collaboration in Practice
What AI accelerated:
- Schema design — Comprehensive JSON schemas with proper BigQuery types
- SQL generation — Transformation scripts with robust error handling
- Documentation — Architecture guides, naming conventions, field mappings
- Learning curve — Quickly ramped up on vulnerability management domain
What human expertise provided:
- Architecture decisions — Chose medallion pattern, partitioning strategies
- Business context — Understood how the security team would use the data
- Quality validation — Reviewed transformations, tested edge cases
- Integration knowledge — Knew how this fit into existing data infrastructure
The collaboration pattern:
Working with AI tools helped me move faster through the repetitive parts (schema definitions, transformation templates) while I focused on the decisions that required understanding our specific context and requirements.
BigQuery Optimizations
-- Partitioning for performance
PARTITION BY DATE(VUL_FIRST_FOUND_DATE)
-- Clustering for query optimization
CLUSTER BY VUL_RISK_LEVEL, VUL_STATE_STATUS, VUL_ASSET_ID
-- Data quality validation
WHERE (
VULNERABILITY_FIRST_FOUND IS NULL
OR VULNERABILITY_FIRST_FOUND = ''
OR SAFE.PARSE_DATETIME('%Y-%m-%d %H:%M:%S', VULNERABILITY_FIRST_FOUND)
>= DATETIME('2023-01-01')
)
Performance impact:
- Partitioning by discovery date → 10x faster time-range queries
- Clustering by risk/status → 5x faster filtered queries
- Date filtering → Reduced data scanned by 60%
Business Value Delivered
Operational Excellence
- 360° vulnerability visibility across all IT assets
- Real-time risk assessment with automated SLA tracking
- Centralized security metrics for executive reporting
- Compliance-ready data structure for audit requirements
Advanced Analytics
- Trend analysis of vulnerability discovery and remediation
- Risk correlation between asset types and vulnerability patterns
- SLA performance tracking with automated alerting
- Predictive insights for proactive security management
Enterprise Features
- Scalable architecture supporting millions of vulnerability records
- Multi-language support for international teams
- Comprehensive audit trail for regulatory compliance
- Optimized performance with BigQuery-native features
Key Takeaways
- Data engineering enables self-service — The security team could already track vulnerabilities; my job was making that data accessible for analysis
- Medallion architecture works — Bronze/silver/gold layers provide clear data contracts and transformation stages
- AI accelerates implementation — The repetitive parts (schemas, transformations) go faster; the thinking parts still need human judgment
- Internal projects matter — This wasn’t a flashy external project, but it delivered real value to our security operations
Technologies Used
- BigQuery — Cloud data warehouse
- SQL — Transformations and optimization
- Medallion Architecture — Data organization pattern
- AI Collaboration — Claude for code generation and documentation
This post demonstrates data engineering patterns from a real internal project. Technical details have been generalized for portfolio purposes while maintaining demonstration value.