No description

Python 89.9%
Shell 8.7%
Dockerfile 1.4%

Find a file

Thomas Kolb ff7be0d9f3 Add docker (compose) build and configuration files		2025-12-20 21:32:43 +01:00
docker	Add docker (compose) build and configuration files	2025-12-20 21:32:43 +01:00
staticfiles	Release everything under GPL v3	2025-11-22 17:18:51 +01:00
.gitignore	.gitignore: vim swap files	2025-12-20 21:31:30 +01:00
config.py	Starting with the currently working version	2025-11-22 12:31:50 +01:00
COPYING	Add COPYING file with GPLv3	2025-11-30 15:04:00 +01:00
dbschema.sql	Starting with the currently working version	2025-11-22 12:31:50 +01:00
generate.py	Fix all warnings identified by the ruff linter	2025-11-30 15:03:12 +01:00
generate_from_ram.py	Fix all warnings identified by the ruff linter	2025-11-30 15:03:12 +01:00
learn.py	Fix all warnings identified by the ruff linter	2025-11-30 15:03:12 +01:00
learn_files.py	Fix all warnings identified by the ruff linter	2025-11-30 15:03:12 +01:00
README.md	Release everything under GPL v3	2025-11-22 17:18:51 +01:00
tarpit.py	Fix all warnings identified by the ruff linter	2025-11-30 15:03:12 +01:00

README.md

Textmarkov

An experiment about Markov Chains and their use to counter excessive content scraping.

This repository contains Python scripts to build a Markov Chain from any text input and generate random text from it. The training scripts (learn.py and learn_files.py) build statistics about how often a word follows after a sequence of other words and stores those counts in an SQLite database. The generation scripts (generate.py and tarpit.py) then start with an empty sequence and select a random follow-up word from the database based on the learned probability.

Building the Markov Chain

To build the markov chain, you need a source of text content, preferably one without copyright issues (the generator will reproduce parts of it). I assume you have a directory source/ with textfiles that you want the markov chain to learn.

First, initialize the database. As a lot of small writes will be done in the file, it is probably a good idea to create it in a ramdisk and move it to persistent storage later.

sqlite3 /tmp/markov.sqlite3 < dbschema.sql

Now, configure the “history length” of the markov chain by adjusting HISTSIZE in config.py. Larger values produce more consistent sentences, but will probably reproduce large contiguous parts of your source if the source set is small. With smaller values, the output will have more variation, but also be very nonsensical. Try the values 2 and 3 as a starting point. Note that large values (above 4) will also increase the database size and processing time a lot.

After configuration, you are ready to start the learning process. learn_files.py takes a list of files to analyze. Use it like this:

./learn_files.py /tmp/markov.sqlite3 source/*

If your source directory contains nested directories, you can also use:

find source/ -type f -print0 | xargs -0 ./learn_files.py /tmp/markov.sqlite3

At this stage, you can run learn_files.py as many times as you want and all results will be accumulated in the database.

Before you can generate text, a post-processing step is necessary. This step updates the statistics of follow-up words for each prefix for the entire database and must be run after every change to the database. Run the following:

./learn_files.py /tmp/markov.sqlite3 --postprocess

Finally, move the database to persistent storage.

Generating Text

To generate a chunk of text from the database simply run:

./generate.py markov.sqlite3

Trapping Bots

This repository contains a tarpit script that can keep scraper bots busy that are not obeying to rules such as robots.txt. It generates HTML pages with some random text and random, but stable links to further generated pages.

This tarpit was specifically created to keep bots from scraping my Forgejo instance, which was quite overloaded by them. That’s why the templates embed the generated text in a <code> tag. Generated URLs also contain a random “commit hash” and file name which serve as the seed of the markov generator, so loading the same URL always results in the same content.

Basic Setup

Basically, tarpit.py is sufficient, which creates a webserver generating an infinite, but stable maze of markov-generated text. However, it only can server one connection at a time and that is not much fun with bots that want to establish thousands of connections at a time. The first step is therefore to multithread the server using gunicorn. Run the following in the directory of tarpit.py:

gunicorn --threads 4 -t 1800 --max-requests 10000 tarpit:app

Scaling It Up (Reverse Proxy and Rate Limiting)

Unfortunately, the setup with gunicorn can use quite some CPU power because there is no rate limiting. A good way to change that is to set up an Nginx reverse proxy in front of gunicorn. You can use the following snippet in the server section of your Nginx configuration:

server {
    # generate some new random content and URLs to keep them busy
    location /rnd/markov {
        proxy_pass http://127.0.0.1:8000;
        proxy_buffering on;
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        limit_rate 5; # bytes per second
        limit_rate_after 512; # bytes
    }
}

Now each connection is served at 5 bytes per second, with an initial burst of 512 bytes to push through the HTTP headers quickly. That keeps them busy for some time.

Another limit of overcome is the number of connections in Nginx. If necessary, increase the number of worker processes and connections per worker in nginx.conf:

worker_processes 4;
events {
    worker_connections 4096;
    use epoll;
}

Static Content

Now, the method above works for some time, but at some point you will have so many connections that the CPU is the limit again. One method I found to be very effective is to serve static content for a majority of the generated URLs.

The static content can be generated using the scripts in the staticfiles/ subdirectory. Run gen.sh first. It uses generate.py to create 256 static text files with random content. Then run mkhtml.sh which takes the text files and embeds their content into HTML files with the same format as the pages generated by tarpit.py. Put those HTML files in some directory on your server. I'm assuming /var/www/htdocs/aispam here.

Then add the following snippets to your Nginx server configuration before the location /rnd/markov we created earlier:

server {
    # slowly serve static markov content if the commit hash starts with 0..e
    location ~ ^/rnd/markov/src/commit/([0-9a-e][0-9a-f]).*$ {
        default_type text/html;
        alias /var/www/htdocs/aispam/static_$1.html;
        limit_rate 5; # bytes per second
        limit_rate_after 512; # bytes
    }

    location /rnd/markov {
        ...
    }
}

Now all requests having a “commit hash” starting with 0-9 or a-e are served from static files. That is 15 out of 16 requests on average that are not going to tarpit.py! The remaining 6.25% are still served by tarpit.py to ensure there are enough new URLs to keep the bots busy.

This setup can scale to more than 10000 connections on a cheap hosted virtual machine.

Keeping nice bots out of the tarpit

Use robots.txt to tell the nice bots that the should not go into the tarpit directory:

User-agent: *
Disallow: /rnd/markov

You can leave that part away, of course, but your server will probably be blocked by all web crawlers, including regular seach engines.

Force the Bots into the Tarpit

To lure bots into the tarpit you just have to place some links into it somewhere. However, if you want to force some bots into it based on their user agent, you can do so, of course.

First, create a new file /etc/nginx/useragent.rules with the following format:

map $http_user_agent $badagent {
        default                                  0;
        ~BadBot/([0-3]\.)                        1;
        ~BadBot2/([4-6]\.)                       1;
}

It is a list of match rules on the user agent that determine whether the agent is good ($badagent = 0) or bad ($badagent = 1). The first column defines the match string or regex and the second the output (0 or 1). The default values is 0, so all user agents not matching in the list are considered good. Match strings starting with ~ are regular expressions.

Now, include the file in nginx.conf:

http {
    # ...

    include /etc/nginx/useragent.rules;
}

Finally add the following block in your server configuration:

server {
    location / {
        # force unwanted user-agents into the tarpit
        if ($badagent) {
            rewrite ^/[A-Za-z0-9_-]+/[A-Za-z0-9_-]+/(.*)$ /rnd/markov/$1 redirect;
            #return 402; # Payment required
        }

        # otherwise proxy to the normal application (Forgejo)
        proxy_pass http://[::1]:3001;
        proxy_set_header Host $host;
    }
}

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

README.md Unescape Escape