Generate sitemaps and robots.txt

Generate XML sitemaps and robots.txt files, hooked into Parklife’s build process. After all HTML pages are built, the sitemap generator discovers them, assigns priorities, and provides accurate last modification dates using Git timestamps.

Add Parklife after build hook

Parkfile

...

Parklife.application.after_build do |application|
  sitemap = Sitemap.new(base_url: application.config.base, build_dir: application.config.build_dir, generate_robots: true)

  return Rails.logger.error("Error generating sitemap: #{sitemap.errors.full_messages.join(', ')}") unless sitemap.valid?

  sitemap.generate!
end

...

Create the sitemap models

Sitemap model: app/models/sitemap.rb
Sitemap::Entry model: app/models/sitemap/entry.rb
Sitemap::RobotsGeneratable concern: app/models/sitemap/robots_generatable.rb

Remove default robots.txt

Delete the default robots.txt file from the public/ folder.

Implementation Details

Generated Files

The system generates three files in your build directory:

sitemap.xml - Standard XML sitemap with proper formatting
sitemap.xml.gz - Minified and compressed version for better performance
robots.txt - Search engine directives with sitemap references

Sitemap Features

Automatic Discovery

The sitemap automatically discovers all HTML files in the build directory, excluding error pages (404.html, 500.html, etc.).

Priorities

Pages are assigned priorities based on their URL structure:

Homepage (/) - Priority 1.0
Top-level pages (/page) - Priority 0.8
Nested pages (/section/page) - Priority 0.5

Last Modification Dates

To provide accurate last modification dates, the system attempts to find the original source file for each HTML page. Source files are discovered using CONTENT_PATTERNS:

content/*/%s.md - Markdown files in content directories
app/views/*/%s.html.erb - ERB view templates

The timestamp strategy depends on whether a source file is found:

Git history - Last commit date for the source file
File system - File modification time if the file isn’t tracked in Git
Current time - Used when the source file was not found

Change Frequency

All pages are marked with a “monthly” change frequency by default.

Model Architecture

The implementation follows Rails conventions with a main Sitemap model and supporting classes:

Sitemap - Main model handling generation and file writing
Sitemap::Entry - Represents individual sitemap entries
Sitemap::RobotsGeneratable - Mixin for robots.txt generation

Robots.txt Content

The generated robots.txt file includes:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml.gz
Sitemap: https://example.com/sitemap.xml

Both compressed and uncompressed sitemap URLs are included for maximum compatibility.

Configuration

The sitemap generator requires:

base_url - The site’s base URL
build_dir - Directory containing built HTML files
generate_robots - Boolean flag to enable robots.txt generation

Error pages defined in ERROR_PAGES constant are automatically excluded from the sitemap.

Commit: Sitemap and Robots

→ What next?

← Optimize metadata for crawlers