Ansible Fact Caching: The --limit Problem and Environment Separation Pain Points

29 January 2025 • 10 min read • Infrastructure

Ansible fact caching promises performance improvements and cross-playbook fact persistence. Instead, it delivers frustrating limitations that have plagued operations teams for years. You can't use memory caching with --limit operations. There's no way to configure dynamic cache locations. These problems create operational complexity with no elegant solutions.

The Memory Cache --limit Catastrophe

Ansible's memory cache plugin is the default fact caching mechanism. It stores facts only for the current playbook execution. This breaks targeted deployments when you use the --limit flag.

The Core Problem

When you use memory caching with --limit, Ansible only gathers facts for hosts within the limit scope. Playbook tasks that reference hostvars for hosts outside the limit will fail:

# Example demonstrating the --limit problem with memory caching
# This playbook will fail when run with --limit if dependent facts are needed

---
- name: Deploy application servers
  hosts: app_servers
  gather_facts: yes
  tasks:
    - name: Configure application
      template:
        src: app.conf.j2
        dest: /etc/app/app.conf
      vars:
        # This will fail with --limit if db_servers facts aren't cached
        db_primary_ip: "{{ hostvars[groups['db_servers'][0]]['ansible_default_ipv4']['address'] }}"
        
- name: Update load balancer
  hosts: lb_servers
  gather_facts: yes
  tasks:
    - name: Update backend pool
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      vars:
        # This will fail with --limit app01 because other app servers aren't gathered
        backend_servers: |
          {% for host in groups['app_servers'] %}
          server {{ hostvars[host]['ansible_default_ipv4']['address'] }}:8080;
          {% endfor %}

Running this playbook with --limit app01 fails. The database server facts aren't gathered, so hostvars[groups['db_servers'][0]] is empty.

The Devastating Impact

This limitation makes memory caching incompatible with common operational patterns:

Rolling deployments: Cannot deploy one server at a time when templates reference other servers
Targeted maintenance: Emergency fixes to single hosts fail when they depend on cluster facts
Load balancer updates: Cannot update one load balancer with backend pool information
Cross-service coordination: Microservice deployments break when services reference each other

#!/bin/bash

# This demonstrates the problem with --limit and memory caching
# The playbook will fail because facts for non-limited hosts aren't available

echo "=== Running with --limit (this will likely fail) ==="
ansible-playbook -i inventory/production deploy.yml --limit app01

echo ""
echo "Error: Cannot access hostvars for hosts not in the --limit scope"
echo "because memory cache only contains facts for gathered hosts"

File-Based Caching: Trading One Problem for Another

The obvious solution is switching to persistent cache plugins like jsonfile or Redis. This solves the --limit problem. But it introduces equally frustrating environment separation issues.

The Environment Isolation Problem

Multi-environment infrastructures need isolated fact caches. This prevents cross-contamination between development, staging, and production environments. But Ansible provides no mechanism to dynamically configure cache locations.

The fact_caching_connection parameter is read once at startup from ansible.cfg. You can't change it dynamically. This makes shared configurations impossible:

---
# This DOESN'T WORK - you cannot dynamically set fact cache location
# Demonstrating what many people try but fails

- name: Attempt to set dynamic cache path
  hosts: localhost
  gather_facts: no
  vars:
    environment: "{{ lookup('env', 'ENVIRONMENT') | default('development') }}"
    cache_path: "/tmp/ansible-facts-{{ environment }}"
  tasks:
    # This has no effect - fact_caching_connection is read only at startup
    - name: Try to set cache path dynamically
      set_fact:
        fact_caching_connection: "{{ cache_path }}"
      failed_when: false  # This won't work but won't fail the playbook
      
    - name: Show the harsh reality
      debug:
        msg: |
          REALITY CHECK:
          - fact_caching_connection cannot be changed at runtime
          - It's read from ansible.cfg at startup only
          - Environment variables won't help here either
          - You're stuck with separate config files

The Only Working Solutions: Operational Workarounds

After years of this limitation, operations teams have developed several workarounds. None of them are elegant or maintainable at scale.

Workaround 1: Pre-populate Cache Strategy

The most reliable approach is running a dedicated fact-gathering playbook before any --limit operations:

---
# Dedicated playbook for pre-populating fact cache
# Run this before using --limit to ensure all facts are cached

- name: Gather facts for all hosts
  hosts: all
  gather_facts: yes
  tasks:
    - name: Display gathered fact count
      debug:
        msg: "Gathered {{ ansible_facts | length }} facts for {{ inventory_hostname }}"
        
    - name: Show cache status
      debug:
        msg: "Facts cached for later --limit operations"

This requires a two-step process for every targeted deployment:

#!/bin/bash

# Workaround: Pre-populate fact cache before running with --limit

echo "=== Step 1: Gather facts for all hosts (populates cache) ==="
ansible-playbook -i inventory/production gather-facts.yml

echo ""
echo "=== Step 2: Run deployment with --limit (now works with cached facts) ==="
ansible-playbook -i inventory/production deploy.yml --limit app01

echo ""
echo "Success: Cached facts are available for all hosts even with --limit"

Drawbacks of Cache Pre-population

Performance penalty: Must gather facts for all hosts even for small changes
Stale data risk: Cache might contain outdated information for non-targeted hosts
Operational complexity: Every deployment becomes a multi-step process
Emergency response impact: Critical fixes require full fact gathering first

Workaround 2: Environment-Specific Configuration Files

For environment separation, the only solution is maintaining separate ansible.cfg files with different cache locations:

Environment	Configuration File	Cache Location
Development	`ansible-dev.cfg`	`/tmp/ansible-facts-dev`
Staging	`ansible-staging.cfg`	`/tmp/ansible-facts-staging`
Production	`ansible-prod.cfg`	`/tmp/ansible-facts-prod`

Development configuration example:

[defaults]
inventory = inventory/development
host_key_checking = False
gathering = smart

# Development environment fact caching
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible-facts-dev
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Production configuration example:

[defaults]
inventory = inventory/production
host_key_checking = False
gathering = smart

# Production environment fact caching
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible-facts-prod
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Environment-Specific Execution Script

Most teams wrap ansible-playbook in environment-aware scripts:

#!/bin/bash

# Environment-specific Ansible execution script
# This is the only practical workaround for environment-specific fact caching

ENVIRONMENT=\${1:-development}

case \$ENVIRONMENT in
    "development")
        ANSIBLE_CONFIG="ansible-dev.cfg"
        ;;
    "staging")
        ANSIBLE_CONFIG="ansible-staging.cfg"
        ;;
    "production")
        ANSIBLE_CONFIG="ansible-prod.cfg"
        ;;
    *)
        echo "Error: Unknown environment '\$ENVIRONMENT'"
        echo "Usage: \$0 [development|staging|production]"
        exit 1
        ;;
esac

echo "Using configuration: $ANSIBLE_CONFIG"
echo "Fact cache will be environment-specific"

# Export the config and run ansible-playbook
export ANSIBLE_CONFIG
ansible-playbook -i "inventory/\$ENVIRONMENT" "\$@"

Configuration Maintenance Nightmare

Configuration drift: Multiple files inevitably diverge over time
Documentation burden: Teams must document which config to use when
Error-prone operations: Easy to use wrong configuration file
Onboarding complexity: New team members struggle with multiple configs

Alternative Cache Plugins: Same Problems, Different Complexity

Redis and other persistent cache plugins solve the --limit problem. But they don't address environment separation:

[defaults]
inventory = inventory
host_key_checking = False
gathering = smart

# Redis-based fact caching (still doesn't solve environment separation)
fact_caching = redis
fact_caching_connection = localhost:6379:0
fact_caching_timeout = 86400

# Note: All environments will share the same Redis cache
# which can lead to cross-environment contamination
# You'd need separate Redis instances or key prefixes (not supported)

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Redis Cache Limitations

No key prefixing: Cannot separate environments in single Redis instance
Infrastructure dependency: Requires Redis server management
Network complexity: Another service to secure and monitor
Cross-environment contamination: All environments share same keyspace

Memory Usage Concerns

Recent AWX issue reports highlight memory consumption problems with fact caching in large inventories. Each job can consume 1.7GB+ of memory when caching facts for 1700+ hosts. This leads to controller OOM conditions.

The Real-World Impact

These limitations create operational friction that affects entire organizations:

DevOps Team Frustration

Deployment delays: Simple changes require complex pre-steps
Emergency response problems: Critical fixes can't be deployed quickly
Tool complexity: Wrapper scripts and documentation overhead
Training burden: New team members need extensive onboarding

Architectural Compromises

Teams often architect around Ansible's limitations instead of optimal infrastructure:

Avoiding cross-references: Designing services to not reference each other
Static configurations: Using hardcoded values instead of dynamic discovery
Monolithic playbooks: Avoiding modular designs that would require --limit
External coordination: Using other tools for tasks Ansible should handle

What Ansible Should Provide (But Doesn't)

The Ansible community has requested these features for years. They remain unimplemented:

Dynamic Cache Configuration

The ability to set cache locations dynamically would solve the environment separation problem:

# This should work but doesn't
---
- name: Set environment-specific cache
  set_fact:
    fact_caching_connection: "/tmp/facts-{{ ansible_environment }}"
    cacheable: yes

Environment Variables for Cache Paths

Environment variable support for all cache plugin parameters would enable flexible deployments:

# This should work but doesn't
export ANSIBLE_FACT_CACHE_CONNECTION="/tmp/facts-${ENVIRONMENT}"
ansible-playbook deploy.yml

Cache Key Prefixing

Built-in support for cache key prefixes would enable environment separation with shared backends:

# This should be possible but isn't
[defaults]
fact_caching = redis
fact_caching_connection = localhost:6379:0
fact_caching_prefix = "${ENVIRONMENT}"

Performance and Scalability Considerations

Beyond functionality issues, fact caching introduces performance considerations. Operations teams must carefully manage these:

Memory Consumption Patterns

Large inventories: Memory usage scales linearly with host count
Rich fact sets: Modern systems generate extensive fact data
Controller limits: AWX/Tower controllers can hit memory limits
Concurrent jobs: Multiple playbooks multiply memory usage

Cache Timeout Management

Cache timeout configuration requires balancing performance with data freshness:

Short timeouts: Frequent fact gathering negates performance benefits
Long timeouts: Stale data leads to deployment inconsistencies
Environment differences: Production needs longer caches than development
Cache invalidation: No mechanism for selective cache clearing

Best Practices for Working Around the Pain

Until Ansible addresses these fundamental limitations, operations teams can minimize the pain with disciplined practices:

Operational Discipline

Standardize scripts: Always use wrapper scripts for environment selection
Document extensively: Clear procedures for cache management
Automate cache warming: Cron jobs to pre-populate caches
Monitor cache health: Alerts for cache staleness and size

Architecture Patterns

Minimize cross-references: Reduce dependencies between host groups
External discovery: Use Consul or similar for service discovery
Template pre-processing: Generate configurations outside Ansible
Incremental deployments: Design for full-environment updates

Monitoring and Alerting

Cache size monitoring: Track memory and disk usage
Fact freshness checks: Verify cache timestamps
Failed deployment alerts: Quick detection of cache-related failures
Performance tracking: Monitor fact gathering times

Alternative Tools and Migration Strategies

Some organizations eventually abandon Ansible fact caching entirely. They migrate to tools with better architectural support for these use cases:

External Fact Management

HashiCorp Consul: Service discovery with environment isolation
etcd: Distributed key-value store with namespace support
HashiCorp Vault: Secrets and configuration management
Custom APIs: Application-specific configuration services

Configuration Management Alternatives

Terraform: Infrastructure as code with better state management
Pulumi: Modern infrastructure as code with programming languages
Kubernetes: Container orchestration with built-in service discovery
HashiCorp Nomad: Workload orchestration with service mesh

The Path Forward: Community and Vendor Response

This pain has persisted for years despite extensive community discussion. The Ansible project acknowledges these limitations. But it provides no roadmap for resolution.

Community Workarounds

The community has developed numerous workarounds. They remain fragmented and organization-specific. Popular approaches include:

Custom cache plugins: Organization-specific solutions
Wrapper tooling: Scripts and frameworks around Ansible
Hybrid architectures: Combining Ansible with other tools
Process changes: Adapting workflows to tool limitations

Vendor Solutions

Red Hat's Ansible Automation Platform provides some improvements through Automation Controller (formerly AWX/Tower). But the core fact caching limitations remain.

Conclusion: Living with the Pain

Ansible fact caching represents one of those infrastructure tools that promises elegant solutions but delivers operational complexity. The fundamental limitations around --limit operations and environment separation have no clean solutions. This forces operations teams into elaborate workarounds.

The memory cache --limit incompatibility makes the default configuration unsuitable for production operations. Persistent caching requires complex configuration management to achieve environment separation. After years of community requests, these problems remain unaddressed.

Organizations serious about infrastructure automation eventually develop patterns that work around these limitations. Or they migrate to tools with better architectural support for multi-environment operations. The key is recognizing these limitations early and designing operational processes that account for them. Don't fight against the tool's constraints.

Until Ansible provides dynamic cache configuration and proper environment isolation, operations teams must choose between operational complexity and architectural compromises. Neither choice is ideal. But understanding the tradeoffs enables informed decisions about tooling and process design.