LinkedIn Scraper - Rust Implementation

A Rust implementation of a LinkedIn scraper that can extract company profiles, job listings, and people profiles.

Features

Company Profile Spider: Scrapes company information including name, summary, industry, size, and founding date
Jobs Spider: Scrapes job listings with pagination support
People Profile Spider: Scrapes people profiles including experience and education
Concurrent Processing: Configurable concurrent request handling
HTTP Client: Built-in retry mechanisms and rate limiting handling
JSON Output: Saves data in JSON format with timestamps
Configurable Timeouts: Customizable request timeouts and retry settings

Installation

Make sure you have Rust installed (https://rustup.rs/)
Clone this repository
Build the project:

cargo build --release

Usage

The scraper provides three main commands:

Company Profile Scraper

# Scrape specific company profiles
cargo run -- company-profile --urls "https://www.linkedin.com/company/microsoft" --urls "https://www.linkedin.com/company/google"

Jobs Scraper

# Scrape job listings
cargo run -- jobs --keywords "rust developer" --location "San Francisco"

People Profile Scraper

# Scrape people profiles
cargo run -- people-profile --profiles "danielefalchetti"

Command Line Options

Global Options

-c, --concurrent <N>: Number of concurrent requests (default: 1)
-o, --output <PATH>: Output directory for JSON files (default: "data")
--timeout <SECONDS>: Request timeout in seconds (default: 30)
--retries <N>: Maximum number of retries for failed requests (default: 3)

Jobs Command Options

--keywords <KEYWORDS>: Search keywords
--location <LOCATION>: Job location

Company Profile Command Options

--urls <URL>: Company profile URLs (can be specified multiple times)

People Profile Command Options

--profiles <PROFILE>: LinkedIn profile usernames (can be specified multiple times)

Environment Variables

You can set configuration via environment variables:

CONCURRENT_REQUESTS: Number of concurrent requests
REQUEST_TIMEOUT: Request timeout in seconds
MAX_RETRIES: Maximum number of retries for failed requests
RETRY_DELAY_MS: Delay between retries in milliseconds
USER_AGENT: Custom user agent string

Architecture

The scraper follows a modular architecture:

Spiders: Define scraping logic for each data type
HTTP Client: Handles requests with retry mechanisms and rate limiting
Pipeline: Processes and saves scraped items
Middleware: Extensible request/response processing

HTTP Client Features

The built-in HTTP client includes:

Automatic retries with exponential backoff
Rate limiting detection and handling
Configurable timeouts
Connection pooling for better performance
User-agent rotation support
Comprehensive error handling

Development

To run in development mode with debug logging:

RUST_LOG=debug cargo run -- jobs

Performance Considerations

The scraper respects rate limits by default (1 concurrent request)
Increase concurrency carefully to avoid being blocked
Use appropriate timeout and retry settings for your use case
Consider implementing delays between requests for production use

Rate Limiting and Best Practices

The scraper includes several mechanisms to handle rate limiting:

Exponential backoff on retries
429 status code detection with automatic retry
Configurable delays between requests
Connection pooling to reduce overhead

License

This project is for educational purposes only. Please respect LinkedIn's Terms of Service and robots.txt when using this scraper.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LinkedIn Scraper - Rust Implementation

Features

Installation

Usage

Company Profile Scraper

Jobs Scraper

People Profile Scraper

Command Line Options

Global Options

Jobs Command Options

Company Profile Command Options

People Profile Command Options

Environment Variables

Architecture

HTTP Client Features

Development

Performance Considerations

Rate Limiting and Best Practices

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

falcosan/in_scraper

Folders and files

Latest commit

History

Repository files navigation

LinkedIn Scraper - Rust Implementation

Features

Installation

Usage

Company Profile Scraper

Jobs Scraper

People Profile Scraper

Command Line Options

Global Options

Jobs Command Options

Company Profile Command Options

People Profile Command Options

Environment Variables

Architecture

HTTP Client Features

Development

Performance Considerations

Rate Limiting and Best Practices

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages