|
4 months ago | |
---|---|---|
debian | 5 months ago | |
src | 4 months ago | |
.gitignore | 5 months ago | |
Cargo.lock | 4 months ago | |
Cargo.toml | 4 months ago | |
Jenkinsfile | 5 months ago | |
README.md | 4 months ago | |
config.example.toml | 4 months ago | |
rustfmt.toml | 5 months ago |
Wip!
I am currently renting around 100 web proxies for use in a side project which involves scraping a website("Site A"). However, there is no reason why I couldn't use these proxies in other side projects which scrape other sites("Site B", "Site C", etc.) Also, if I scrape "Site A" in various different side projects, or in different processes of the same side project, I need to be careful not to hit any rate limits.
This project is my attempt at solving these problems.
First, a TOML file must be written which lists the available proxies, as well as the possible websites to scrape and their rate limits.
An example file, config.example.toml
, is provided, which lists 3 proxies, and sets up Facebook with a limit of one request per 5 seconds,
and Google with no rate limit.
Hitting http://localhost:3030/v1/facebook will return a random proxy address. It will always ensure that no proxy makes more requests than the rate limit allows. If necessary, it will wait for the correct amount of time to pass before returning a proxy address. Of course, hits to Google don't affect Facebook, so we keep track of their rate limits separately.
It is also possible to POST http://localhost:3030/v1/facebook/ with the JSON body { "proxy": "localhost:1234" }
, which will blacklist that proxy/site combination for 15 minutes.
This is normally done if you get blocked by the site and need to stop making requests for a while.