Back to Blog

2025 — My Year in Data

A retrospective on building a full Data Mesh environment at MercadoLibre's Internal Systems, migrating 6+ pipelines to production, and planting the seeds of a data-driven culture initiative.

Data EngineeringData MeshMercadoLibreYear in ReviewBigQueryDataFlow

Where It Started

At the beginning of 2025, I inherited something both exciting and humbling: a Data Mesh Environment (DME) with potential but almost no data products. The DME — a governed BigQuery environment under crs-csec-is-gov — existed on paper. It had the right architecture, the right governance model, the right tooling. What it didn’t have was data.

203 production tables, 162 published views, and 93 active scheduled jobs by year’s end. None of those existed in January.

This is a post about what it took to get there, what I learned along the way, and what I would do differently.


The Infrastructure Year

If I had to describe 2025 in a single word, it would be foundation. The year was almost entirely about building the plumbing before anyone could turn on the faucet.

The core work was migrating six data domains from the sandbox — a collection of unmonitored stored procedures and undocumented BigQuery tables — to the Data Mesh. Each migration followed the same architecture: Bronze (raw extraction) → Silver (typed, validated, business-enriched) → Gold (aggregated data marts). Every layer uses a different BigQuery dataset, enforces audit fields, and runs on scheduled DataFlow jobs with automated alerting.

Here’s what shipped:

SNIPE-IT — The Canonical Migration

The first major migration was SNIPE-IT, our IT asset management system. Before the migration, the entire 192K+ device inventory lived in a stored procedure called SP_ASSETS_UNSNIPEIT — a 40-year-old pattern running on MySQL with no versioning, no monitoring, and a single point of failure: me.

After the migration: 29 tables (10 Bronze, 14 Silver, 5 Gold), three DataFlow jobs running twice a day, and a DM_SNP_WIDETABLE with 243+ columns that any team can query without asking me for a JOIN. The stored procedure is retired.

That migration became the template. The lesson: the hardest part of a migration is not the SQL — it’s building the case that the new system is better than the old one. For SNIPE-IT, that case was clear (uptime, discoverability, self-service). For other domains, it required more patience.

WATSON — Vulnerability Management at Scale

Watson is our vulnerability management system. It tracks CVEs, open findings, remediation timelines, and risk scores across the organization. Before the migration, the data was trapped in proprietary system exports, available only to the security team.

After: 28 tables, daily refresh, and a DM_WATSON_CENTRAL_METRICS that Looker Studio dashboards consume directly. The Gold mart pre-calculates risk scores, aging buckets, and office-level compliance rates — things that used to require a custom SQL request to the security team.

SINERGYS — Physical Access at 5x Daily

Genetec Synergis is our physical access control system — the data behind every badge swipe at MercadoLibre facilities across the region. The volume isn’t massive (~500K cardholder records), but the freshness requirement is: operational teams need near-real-time access data.

The SINERGYS pipeline runs five times a day. 22 tables. The Gold mart DM_SYN_CARDHOLDER gives IS operations a 360° view of who has access to what, without querying the Genetec SQL Server directly.

GMT, BAILEY, ARGOS — The Long Tail

The remaining three migrations (ticket management, employee lifecycle, and compliance licensing) followed the same pattern but presented different challenges.

GMT (our ticketing system) had 30 source tables and 7 legacy stored procedures that other systems depended on. The Bronze layer migrated cleanly; the Silver and Gold layers required rewriting years of accumulated business logic.

BAILEY (employee onboarding/offboarding) involved PII handling from day one. We spent one full sprint designing the data model to ensure sensitive fields were either excluded or masked before landing in Silver. The PII elimination framework built for BAILEY became the reference for all future pipelines.

ARGOS (compliance licensing for CrowdStrike, Microsoft 365, Google Workspace) was the newest domain. The data arrives via REST API — not a database — so Bronze required a custom Python extractor. By year’s end, the full orchestrator runs five times a day and feeds DM_ARG_COMPLIANCE with licensing compliance data for the entire organization.


Technical Highlight: CHECKSECDB

One pipeline deserves its own paragraph: THUNDERA, internally built on the CHECKSECDB database.

CHECKSECDB is MercadoLibre’s internal compliance tracking system — it stores task checklists, module scores, asset inventories, and finding records for every operational center across Latin America. The data model is complex (21 Bronze extractions, 16 Silver tables), and the output is four Gold marts that power the main compliance dashboard used by IS leadership.

The reason this pipeline stands out is the architecture decision: rather than building one massive reporting stored procedure, we separated concerns across layers. Bronze is dumb (raw extraction, no transformations). Silver is typed and business-enriched. Gold is aggregated for consumption. Each layer can be rebuilt independently. When a source schema changes, only Bronze needs updating.

This is textbook medallion architecture. It still took three months to get right.


The Governance Layer

Building pipelines is the visible part of data engineering. The invisible part is what makes them trustworthy.

Midway through the year, a governance audit revealed that 145 of our production tables had BAD_QUALITY=1 in the official Artifacts Health dashboard — meaning their descriptions were either missing or too generic to be useful. A table called BT_ARG_WSONE with the description “WS1 devices” is technically documented. It is not actually useful.

We started a systematic enrichment project: generate business-context descriptions for every column across seven domains, apply them through the DataFlow platform, and track the quality score monthly. It’s unglamorous work. It’s also the difference between a data product and a data warehouse.

MercadoLibre’s governance framework measures three things: Discoverability (can someone find and understand your data?), Efficiency (does it cost what it should to maintain?), and Integrity (is it accurate and reliable?). By year’s end, we had a baseline for each. The targets for 2026 are set.


Seeds Planted for 2026

Two initiatives started in late 2025 that will define 2026:

The DDI Initiative — IS Data Boost is a program to improve the Data Driven Index for Internal Systems from 12.6% to 30% Intermediate+ by December 2026. DDI is MercadoLibre’s measure of data tool adoption per employee. The first leader mapping was done in December, and the first hands-on sessions are scheduled for Q1 2026. More on this in a separate post.

The is-datalake-toolkit — A set of Claude skills I’m building to help the team validate jobs before promoting to production, scaffold new pipelines according to architecture standards, and troubleshoot failing jobs automatically. The goal is to embed architectural governance into the development workflow, not just the review process.


What I’d Do Differently

Communicate value earlier. I spent most of the year building pipelines and very little time explaining what those pipelines enabled. By the time I had 200+ tables in production, the value was obvious to me and invisible to almost everyone else. Architecture without narrative is infrastructure without users.

Define “done” before starting. Each pipeline took longer than expected not because of technical complexity, but because the definition of done kept expanding. A Bronze extraction became “also do the Silver.” A Silver migration became “also update the Gold mart.” Scope creep in data engineering looks exactly like good intentions.

Ship smaller. The temptation with a greenfield environment is to design the perfect system from the start. The better move is to get one domain to production, learn from it, then apply those lessons to the next one. We did this eventually. I wish we had done it from day one.


Closing

2025 was the year the DME went from potential to production. The numbers are real — 203 tables, 93 scheduled jobs, six domains migrated, and a governance framework in place. But numbers don’t capture what it actually means: the first time a security analyst queried DM_WATSON_CENTRAL_METRICS and got their vulnerability risk score without filing a request, or the IS operations team checking DM_SYN_CARDHOLDER during a physical access audit without involving the Genetec team.

Data infrastructure is most valuable when it’s invisible. When a team can answer their own questions without knowing the system’s name. When the pipeline runs on schedule and nobody notices because there was nothing to notice.

That’s what we built in 2025. Next year, we make it count.