| staticfiles | ||
| .gitignore | ||
| config.py | ||
| COPYING | ||
| dbschema.sql | ||
| generate.py | ||
| generate_from_ram.py | ||
| learn.py | ||
| learn_files.py | ||
| README.md | ||
| tarpit.py | ||
Textmarkov
An experiment about Markov Chains and their use to counter excessive content scraping.
This repository contains Python scripts to build a Markov Chain from any text
input and generate random text from it. The training scripts (learn.py and
learn_files.py) build statistics about how often a word follows after a
sequence of other words and stores those counts in an SQLite database. The
generation scripts (generate.py and tarpit.py) then start with an empty
sequence and select a random follow-up word from the database based on the
learned probability.
Building the Markov Chain
To build the markov chain, you need a source of text content, preferably one
without copyright issues (the generator will reproduce parts of it). I assume
you have a directory source/ with textfiles that you want the markov chain to
learn.
First, initialize the database. As a lot of small writes will be done in the file, it is probably a good idea to create it in a ramdisk and move it to persistent storage later.
sqlite3 /tmp/markov.sqlite3 < dbschema.sql
Now, configure the “history length” of the markov chain by adjusting HISTSIZE
in config.py. Larger values produce more consistent sentences, but will
probably reproduce large contiguous parts of your source if the source set is
small. With smaller values, the output will have more variation, but also be
very nonsensical. Try the values 2 and 3 as a starting point. Note that large
values (above 4) will also increase the database size and processing time a
lot.
After configuration, you are ready to start the learning process.
learn_files.py takes a list of files to analyze. Use it like this:
./learn_files.py /tmp/markov.sqlite3 source/*
If your source directory contains nested directories, you can also use:
find source/ -type f -print0 | xargs -0 ./learn_files.py /tmp/markov.sqlite3
At this stage, you can run learn_files.py as many times as you want and all
results will be accumulated in the database.
Before you can generate text, a post-processing step is necessary. This step updates the statistics of follow-up words for each prefix for the entire database and must be run after every change to the database. Run the following:
./learn_files.py /tmp/markov.sqlite3 --postprocess
Finally, move the database to persistent storage.
Generating Text
To generate a chunk of text from the database simply run:
./generate.py markov.sqlite3
Trapping Bots
This repository contains a tarpit script that can keep scraper bots busy that
are not obeying to rules such as robots.txt. It generates HTML pages with
some random text and random, but stable links to further generated pages.
This tarpit was specifically created to keep bots from scraping my Forgejo
instance, which was quite overloaded by them. That’s why the templates embed
the generated text in a <code> tag. Generated URLs also contain a random
“commit hash” and file name which serve as the seed of the markov generator, so
loading the same URL always results in the same content.
Basic Setup
Basically, tarpit.py is sufficient, which creates a webserver generating an
infinite, but stable maze of markov-generated text. However, it only can server
one connection at a time and that is not much fun with bots that want to
establish thousands of connections at a time. The first step is therefore to
multithread the server using gunicorn. Run the following in the directory of
tarpit.py:
gunicorn --threads 4 -t 1800 --max-requests 10000 tarpit:app
Scaling It Up (Reverse Proxy and Rate Limiting)
Unfortunately, the setup with gunicorn can use quite some CPU power because
there is no rate limiting. A good way to change that is to set up an Nginx
reverse proxy in front of gunicorn. You can use the following snippet in the
server section of your Nginx configuration:
server {
# generate some new random content and URLs to keep them busy
location /rnd/markov {
proxy_pass http://127.0.0.1:8000;
proxy_buffering on;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
limit_rate 5; # bytes per second
limit_rate_after 512; # bytes
}
}
Now each connection is served at 5 bytes per second, with an initial burst of 512 bytes to push through the HTTP headers quickly. That keeps them busy for some time.
Another limit of overcome is the number of connections in Nginx. If necessary,
increase the number of worker processes and connections per worker in nginx.conf:
worker_processes 4;
events {
worker_connections 4096;
use epoll;
}
Static Content
Now, the method above works for some time, but at some point you will have so many connections that the CPU is the limit again. One method I found to be very effective is to serve static content for a majority of the generated URLs.
The static content can be generated using the scripts in the staticfiles/
subdirectory. Run gen.sh first. It uses generate.py to create 256 static
text files with random content. Then run mkhtml.sh which takes the text files
and embeds their content into HTML files with the same format as the pages
generated by tarpit.py. Put those HTML files in some directory on your
server. I'm assuming /var/www/htdocs/aispam here.
Then add the following snippets to your Nginx server configuration before the
location /rnd/markov we created earlier:
server {
# slowly serve static markov content if the commit hash starts with 0..e
location ~ ^/rnd/markov/src/commit/([0-9a-e][0-9a-f]).*$ {
default_type text/html;
alias /var/www/htdocs/aispam/static_$1.html;
limit_rate 5; # bytes per second
limit_rate_after 512; # bytes
}
location /rnd/markov {
...
}
}
Now all requests having a “commit hash” starting with 0-9 or a-e are served
from static files. That is 15 out of 16 requests on average that are not going
to tarpit.py! The remaining 6.25% are still served by tarpit.py to ensure
there are enough new URLs to keep the bots busy.
This setup can scale to more than 10000 connections on a cheap hosted virtual machine.
Keeping nice bots out of the tarpit
Use robots.txt to tell the nice bots that the should not go into the tarpit
directory:
User-agent: *
Disallow: /rnd/markov
You can leave that part away, of course, but your server will probably be blocked by all web crawlers, including regular seach engines.
Force the Bots into the Tarpit
To lure bots into the tarpit you just have to place some links into it somewhere. However, if you want to force some bots into it based on their user agent, you can do so, of course.
First, create a new file /etc/nginx/useragent.rules with the following format:
map $http_user_agent $badagent {
default 0;
~BadBot/([0-3]\.) 1;
~BadBot2/([4-6]\.) 1;
}
It is a list of match rules on the user agent that determine whether the agent
is good ($badagent = 0) or bad ($badagent = 1). The first column defines
the match string or regex and the second the output (0 or 1). The default
values is 0, so all user agents not matching in the list are considered good.
Match strings starting with ~ are regular expressions.
Now, include the file in nginx.conf:
http {
# ...
include /etc/nginx/useragent.rules;
}
Finally add the following block in your server configuration:
server {
location / {
# force unwanted user-agents into the tarpit
if ($badagent) {
rewrite ^/[A-Za-z0-9_-]+/[A-Za-z0-9_-]+/(.*)$ /rnd/markov/$1 redirect;
#return 402; # Payment required
}
# otherwise proxy to the normal application (Forgejo)
proxy_pass http://[::1]:3001;
proxy_set_header Host $host;
}
}
License
Copyright (C) 2024 Thomas Kolb
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.