A Rust implementation of a LinkedIn scraper that can extract company profiles, job listings, and people profiles.
- Company Profile Spider: Scrapes company information including name, summary, industry, size, and founding date
- Jobs Spider: Scrapes job listings with pagination support
- People Profile Spider: Scrapes people profiles including experience and education
- Concurrent Processing: Configurable concurrent request handling
- HTTP Client: Built-in retry mechanisms and rate limiting handling
- JSON Output: Saves data in JSON format with timestamps
- Configurable Timeouts: Customizable request timeouts and retry settings
- Make sure you have Rust installed (https://rustup.rs/)
- Clone this repository
- Build the project:
cargo build --release
The scraper provides three main commands:
# Scrape specific company profiles
cargo run -- company-profile --urls "https://www.linkedin.com/company/microsoft" --urls "https://www.linkedin.com/company/google"
# Scrape job listings
cargo run -- jobs --keywords "rust developer" --location "San Francisco"
# Scrape people profiles
cargo run -- people-profile --profiles "danielefalchetti"
-c, --concurrent <N>
: Number of concurrent requests (default: 1)-o, --output <PATH>
: Output directory for JSON files (default: "data")--timeout <SECONDS>
: Request timeout in seconds (default: 30)--retries <N>
: Maximum number of retries for failed requests (default: 3)
--keywords <KEYWORDS>
: Search keywords--location <LOCATION>
: Job location
--urls <URL>
: Company profile URLs (can be specified multiple times)
--profiles <PROFILE>
: LinkedIn profile usernames (can be specified multiple times)
You can set configuration via environment variables:
CONCURRENT_REQUESTS
: Number of concurrent requestsREQUEST_TIMEOUT
: Request timeout in secondsMAX_RETRIES
: Maximum number of retries for failed requestsRETRY_DELAY_MS
: Delay between retries in millisecondsUSER_AGENT
: Custom user agent string
The scraper follows a modular architecture:
- Spiders: Define scraping logic for each data type
- HTTP Client: Handles requests with retry mechanisms and rate limiting
- Pipeline: Processes and saves scraped items
- Middleware: Extensible request/response processing
The built-in HTTP client includes:
- Automatic retries with exponential backoff
- Rate limiting detection and handling
- Configurable timeouts
- Connection pooling for better performance
- User-agent rotation support
- Comprehensive error handling
To run in development mode with debug logging:
RUST_LOG=debug cargo run -- jobs
- The scraper respects rate limits by default (1 concurrent request)
- Increase concurrency carefully to avoid being blocked
- Use appropriate timeout and retry settings for your use case
- Consider implementing delays between requests for production use
The scraper includes several mechanisms to handle rate limiting:
- Exponential backoff on retries
- 429 status code detection with automatic retry
- Configurable delays between requests
- Connection pooling to reduce overhead
This project is for educational purposes only. Please respect LinkedIn's Terms of Service and robots.txt when using this scraper.