Firecrawl Agent Guide
The Firecrawl agent enables web scraping and data extraction from websites, turning unstructured web content into structured data for analysis.
Overview
- FirecrawlAgent - AI-powered web scraping and content extraction
FirecrawlAgent
The FirecrawlAgent understands: - Web page structure and content - Data extraction patterns - Multi-page crawling - Content cleaning and formatting
Basic Usage
from louieai.notebook import lui
# Simple page scraping
lui("Extract the main content from https://example.com/article", agent="FirecrawlAgent")
# Product information
lui("Scrape product details from this e-commerce page", agent="FirecrawlAgent")
# News aggregation
lui("Collect headlines from this news website", agent="FirecrawlAgent")
Content Extraction
# Article extraction
lui("""
Extract from this blog post:
- Title
- Author
- Publication date
- Main content
- Tags/categories
""", agent="FirecrawlAgent")
# Product catalog
lui("""
Scrape product information including:
- Product name
- Price
- Description
- Images
- Reviews
- Availability
""", agent="FirecrawlAgent")
# Contact information
lui("""
Find and extract:
- Company name
- Address
- Phone numbers
- Email addresses
- Social media links
""", agent="FirecrawlAgent")
Multi-page Crawling
# Paginated results
lui("""
Crawl all pages of search results and extract:
- Item titles
- Prices
- Links
- Follow pagination automatically
""", agent="FirecrawlAgent")
# Site navigation
lui("""
Starting from the homepage, crawl:
- All product categories
- Extract products from each category
- Limit to 100 pages total
""", agent="FirecrawlAgent")
# Documentation sites
lui("""
Scrape entire documentation site:
- Table of contents
- All documentation pages
- Code examples
- Maintain hierarchy
""", agent="FirecrawlAgent")
Structured Data Extraction
# Table extraction
lui("""
Extract all tables from this page and convert to:
- Structured DataFrames
- Include headers
- Handle merged cells
- Clean formatting
""", agent="FirecrawlAgent")
# Form data
lui("""
Extract form fields and options:
- Input field names and types
- Dropdown options
- Default values
- Validation rules
""", agent="FirecrawlAgent")
# Metadata extraction
lui("""
Extract page metadata:
- Meta tags
- Open Graph data
- Schema.org markup
- JSON-LD data
""", agent="FirecrawlAgent")
Common Use Cases
Market Research
# Competitor analysis
lui("""
Analyze competitor website:
- Product offerings
- Pricing information
- Feature comparisons
- Customer reviews
""", agent="FirecrawlAgent")
# Price monitoring
lui("""
Track prices across multiple sites:
- Product name
- Current price
- Historical price if available
- Stock status
""", agent="FirecrawlAgent")
Content Aggregation
# News monitoring
lui("""
Aggregate news from multiple sources about:
- Specific keywords
- Company mentions
- Industry updates
- Publication dates
""", agent="FirecrawlAgent")
# Job listings
lui("""
Collect job postings matching:
- Job title
- Company
- Location
- Salary range
- Requirements
""", agent="FirecrawlAgent")
Data Collection
# Research data
lui("""
Extract research data:
- Statistical tables
- Chart data
- Citations
- Methodology sections
""", agent="FirecrawlAgent")
# Directory scraping
lui("""
Extract business listings:
- Business name
- Category
- Contact details
- Hours of operation
- Reviews/ratings
""", agent="FirecrawlAgent")
Advanced Features
Dynamic Content
# JavaScript-rendered content
lui("""
Extract content from this React/Vue/Angular app:
- Wait for dynamic loading
- Capture AJAX-loaded data
- Handle infinite scroll
""", agent="FirecrawlAgent")
# Interactive elements
lui("""
Extract data that requires interaction:
- Click tabs to reveal content
- Expand accordions
- Load more buttons
""", agent="FirecrawlAgent")
Data Cleaning
# Clean extracted data
lui("""
Extract and clean:
- Remove ads and popups
- Strip formatting tags
- Normalize whitespace
- Convert to plain text
""", agent="FirecrawlAgent")
# Format conversion
lui("""
Extract content and convert to:
- Markdown format
- Clean HTML
- Structured JSON
- CSV for tables
""", agent="FirecrawlAgent")
Filtering and Selection
# Selective extraction
lui("""
Extract only:
- Main article content
- Skip navigation and sidebars
- Ignore advertisements
- Focus on relevant sections
""", agent="FirecrawlAgent")
# Pattern matching
lui("""
Find and extract all:
- Email addresses
- Phone numbers
- Prices with currency
- Dates in any format
""", agent="FirecrawlAgent")
Best Practices
Respectful Scraping
# Rate limiting
lui("""
Scrape this site respectfully:
- Maximum 1 request per second
- Respect robots.txt
- Use appropriate user agent
- Handle rate limit responses
""", agent="FirecrawlAgent")
Error Handling
# Robust extraction
lui("""
Extract data with fallbacks:
- Primary selectors
- Alternative selectors
- Default values for missing data
- Error reporting
""", agent="FirecrawlAgent")
Data Quality
# Validation
lui("""
Extract and validate:
- Check data completeness
- Verify expected formats
- Flag anomalies
- Report extraction confidence
""", agent="FirecrawlAgent")
Integration with Other Agents
# Scrape data
lui("Extract product catalog from competitor site", agent="FirecrawlAgent")
scraped_data = lui.df
# Analyze with SQL
lui("Store this scraped data in our database", agent="PostgresAgent")
# Create visualizations
lui("Visualize price comparisons across competitors", agent="PerspectiveAgent")
# Generate insights
lui("Analyze competitive positioning based on this data", agent="LouieAgent")
Output Formats
Structured Data
# DataFrame output
lui("""
Extract as DataFrame:
- Column headers from page
- Consistent data types
- Handle missing values
- Index by unique identifier
""", agent="FirecrawlAgent")
Nested Data
# Hierarchical extraction
lui("""
Extract nested structure:
- Categories
- Subcategories
- Products
- Attributes
Maintain relationships
""", agent="FirecrawlAgent")
Raw Content
# Full page capture
lui("""
Capture entire page:
- HTML source
- Rendered content
- Resources (images, CSS)
- Screenshot
""", agent="FirecrawlAgent")
Legal and Ethical Considerations
Important: Always ensure you have permission to scrape websites and comply with: - Website terms of service - Robots.txt directives - Rate limiting requirements - Copyright laws - Data protection regulations (GDPR, CCPA)
Next Steps
- Learn about TableAI Agent for analyzing extracted tables
- Explore Code Agent for processing scraped data
- See OpenSearch Agent for indexing web content
- Check the Query Patterns Guide for more examples