What This Session Is About
Running AI experiments inside a Fortune 100 financial institution is a different problem than running them at a startup. The data is more sensitive, the compliance requirements are stricter, and the stakeholder landscape is more complex. But the window for competitive differentiation is real — and the teams that figure out how to move fast inside those constraints will define the next decade.
Asaf shared the approach Northwestern Mutual's GenAI strategy team uses: small, bounded experiments with clear success criteria, aggressive build-vs-buy analysis, and a willingness to switch tools when the data says something isn't working. Including switching from OpenAI to Gemini when the context window made the difference between a broken prototype and a working product.
The Small Bets Framework
Every initiative starts with a time-boxed experiment (2–4 weeks) with a specific, measurable hypothesis. "AI will improve X metric by Y%" — not "AI will transform our BI capabilities." Small scope means fast learning and easy kill decisions when something isn't working.
The team's default in 2025 shifted from "build for control" to "buy and customize." Asaf's reasoning: the maturity of commercial AI tooling has increased dramatically. Building a custom tool is only justified when commercial alternatives have a fundamental architectural mismatch with your requirements.
When a text-to-SQL prototype hit accuracy limits, the team's first instinct was to improve the prompt. The actual fix was switching to Gemini 2.5 with its 1M-token context window — enabling the entire database schema to fit in context. Accuracy went from unreliable to ~100%. The model mattered less than the context capacity.
In financial services, AI systems that interact with customer data need explicit guardrails from day one — not retrofitted at scale. Northwestern Mutual's approach: classify data sensitivity tiers, define allowed AI access per tier, and build the enforcement layer into the AI platform infrastructure rather than into individual applications.
Key Insights
- 01Gemini's large context window was a product-defining capability, not a benchmark stat. The team's text-to-SQL feature was blocked by schema complexity — too many tables and columns to fit in a typical context window. Switching to Gemini 2.5 (1M tokens) allowed the full schema in context and pushed accuracy to ~100%. One infrastructure decision resolved months of prompt engineering attempts.
- 02The industry has matured past the point where building your own tools is the default choice. In 2023, the argument for building custom LLM tooling was strong — commercial options were immature. By late 2025, Asaf's team had reversed that default. Commercial platforms now offer sufficient customization for most enterprise use cases at a fraction of the maintenance cost.
- 03Sensitive data guardrails in financial services need to be in the platform, not the app. Application-level guardrails get bypassed, misconfigured, or simply forgotten as the codebase evolves. The more durable approach is enforcement at the AI infrastructure layer — before any application can call the model, the data sensitivity tier is checked and the context is filtered accordingly.
- 04Scale challenges are real but solvable with the right architecture. Compute and storage costs for enterprise GenAI aren't trivial. Asaf's approach: instrument everything from the start, measure cost-per-experiment, and build a feedback loop between experiment outcomes and infrastructure investment decisions.
- 05BI teams have the data intuition that GenAI teams need. One underappreciated resource: BI analysts who understand the business logic embedded in existing data models. When building text-to-SQL, those analysts were the most valuable collaborators — they knew which schema patterns were meaningful and which were legacy artifacts that would confuse the model.
- 06Small bets compound into large transformations. None of Northwestern Mutual's individual experiments was a moonshot. But 12–18 months of bounded experiments, with the learnings applied systematically, produced a capability portfolio that would have been impossible to specify or fund as a single large initiative.
The maturity of the industry has changed. There's no reason to build the tool yourself anymore — the commercial options have caught up. Your edge is in how you apply the tools to your specific business context, not in the tools themselves.
From the Q&A
How do you handle the compute and storage cost at Fortune 100 scale?
Instrument every experiment with cost tracking from day one. Build a cost-per-outcome metric (cost per successful completion, cost per accurate answer) alongside your quality metrics. When you can show leadership the ROI in those terms, budget conversations become much easier. Asaf noted that teams that only track quality without tracking cost often build systems that can't be justified at scale.
What pushed you from OpenAI to Gemini for the schema problem?
The schema for a large financial services company has hundreds of tables and thousands of columns. No matter how good the prompt engineering, the model couldn't reason accurately across the full schema when only a subset fit in context. Gemini 2.5's 1M-token context let the team put the entire schema in — and the problem effectively solved itself. Daniel Vainsencher in the session captured it well: "The large context of Gemini 2.5 to the rescue?"
How do you prevent sensitive customer data from leaking through AI systems?
Northwestern Mutual uses a tiered data classification approach — each tier maps to explicit rules about which AI models can access it, whether data can be used for fine-tuning, and what output filtering applies. The enforcement is in the AI platform layer, not in individual applications. Every application that calls the AI platform inherits the guardrails automatically.
How do you get organizational buy-in for GenAI experiments when the outcomes are uncertain?
The small bets frame helps enormously. A 4-week experiment with a $50K budget and a specific hypothesis is a much easier sell than a 12-month transformation program. After enough successful small bets, the question shifts from "should we invest in AI?" to "which experiments should we prioritize next?" — and that's a much more productive conversation.