Strategies for Cleaning Spam Users from OJS3 Databases

Strategies for Cleaning Spam Users from OJS3 Databases

In this post, we’ll explore effective strategies for identifying and removing spam users from Open Journal Systems (OJS) version 3 databases. This is particularly relevant for older OJS instances that were created before the implementation of CAPTCHA and other bot protection mechanisms.

The Challenge

Older OJS installations often accumulated numerous spam user accounts due to the initial lack of bot protection mechanisms. While newer versions include CAPTCHA and other security features, many existing databases still contain these legacy spam accounts. Direct deletion of users from the database isn’t possible due to referential integrity constraints, so we need a systematic approach to clean up these accounts.

Understanding the Database Structure

The relevant fields in the OJS users table that we’ll focus on are:

  • username
  • email
  • url
  • date_registered
  • date_last_login

Strategy 1: Email Analysis

Domain-Based Analysis

First, analyze email domains to identify patterns associated with spam accounts:

SELECT substring(email FROM '@(.*?)$') AS domain, COUNT(*) AS email_count
FROM users
GROUP BY domain
ORDER BY email_count DESC;

Pattern-Based Detection

Look for suspicious email patterns using these criteria:

  1. Multiple consecutive special characters:
email LIKE '%.%.%.%.%@%' OR  -- 5 or more dots
email LIKE '%_%_%_%_%@%' OR  -- 5 or more underscores
email ~* '[.]{3,}' OR        -- 3+ consecutive dots
email ~* '[_]{3,}'           -- 3+ consecutive underscores
  1. Density of special characters:
length(regexp_replace(email, '[^.]', '', 'g')) * 100.0 / length(email) > 30 OR  -- >30% dots
length(regexp_replace(email, '[^_]', '', 'g')) * 100.0 / length(email) > 30     -- >30% underscores
  1. Username characteristics:
length(substring(email from '^([^@]+)@')) > 30  -- Unusually long usernames
substring(email from '^([^@]+)@') ~ '^[0-9]+$'  -- Numeric-only usernames

Strategy 2: URL Analysis

Many spam accounts include suspicious URLs. Common patterns include:

  • Online pharmacies
  • Gambling sites
  • Social media manipulation services
  • Cryptocurrency schemes
  • Adult content

Strategy 3: Behavioral Analysis

Spam accounts often show distinctive usage patterns:

WHERE date_last_login - date_registered = INTERVAL '0 second'
AND date_registered < CAST('2024-01-01 00:01:01' AS TIMESTAMP)

This identifies accounts that:

  • Only logged in during registration
  • Were created before a specific date

Putting It All Together

Here’s a comprehensive query combining multiple strategies:

SELECT username
FROM public.users
WHERE 
-- Exclude trusted domains
email NOT LIKE '%academic%'
AND email NOT LIKE '%.edu'
AND email NOT LIKE '%@ac.edu'
AND email NOT LIKE '%@%.org'
AND email NOT LIKE '%@ac.uk'
AND url NOT LIKE '%orcid.org%'

-- Single login check
AND date_registered < CAST('2024-01-01 00:01:01' AS TIMESTAMP)
AND date_last_login - date_registered <= INTERVAL '0 second'

-- URL presence check
AND url != ''

-- Suspicious domains and patterns
AND (
    email LIKE '%@123mail.org'
    OR email LIKE '%@126.com'
    OR email LIKE '%@163.com'
    -- Add more patterns as needed
)

Copy the result as a list of usernames in a ONE COLUMN txt file: users_to_merge.txt

Removing Identified Spam Users

Once you’ve identified spam accounts, you’ll need to use OJS’s mergeUsers.php tool to properly remove them. Here’s a Python script to automate this process:

import subprocess
import sys
import argparse
from datetime import datetime

def merge_users(file_input="users_to_merge.txt", dry_run=False):
    start_time = datetime.now()
    print(f"\nStarting merge process at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
    
    try:
        with open(file_input, 'r') as f:
            users = [line.strip() for line in f if line.strip()]
    except FileNotFoundError:
        print(f"File {file_input} not found.")
        return
    
    if not users:
        print(f"File {file_input} is empty. Skipping.")
        return

    target = users[0]
    other_users = users[1:]

    if dry_run:
        print(f"DRY RUN: Would merge the following users into {target}:")
        for user in other_users:
            print(f"  - {user}")
        return

    for user in other_users:
        print(f"\nAttempting to merge user '{user}' into '{target}'...")
        try:
            result = subprocess.run(
                ['php', 'tools/mergeUsers.php', target, user],
                check=True,
                capture_output=True,
                text=True
            )
            print(f"Merge output: {result.stdout if result.stdout else 'No output'}")
            print(f"Successfully merged user '{user}' into '{target}'")
        except subprocess.CalledProcessError as e:
            print(f"Error merging user {user} into {target}. Continuing...")
            print(f"Error details: {e.stderr}")
            continue

    end_time = datetime.now()
    duration = end_time - start_time
    print(f"\nProcess finished at: {end_time.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Total execution time: {duration}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Merge users from a text file.')
    parser.add_argument('--file', '-f', default="users_to_merge.txt",
                      help='Input file containing usernames')
    parser.add_argument('--dry-run', '-d', action='store_true',
                      help='Show what would be done without performing merges')
    
    args = parser.parse_args()
    merge_users(args.file, args.dry_run)

Important Considerations

  1. False Positives: Always review the results carefully before deletion. Some legitimate users might match spam patterns.

  2. Whitelist Important Domains: Maintain a whitelist of trusted domains to prevent accidental removal of legitimate users.

  3. Backup First: Always backup your database before performing any cleanup operations.

  4. Iterative Approach: Start with the most obvious spam patterns and gradually refine your criteria based on results.

Results

Using these strategies, we successfully identified and removed approximately 16,000 spam accounts from a database containing 24,000 records, significantly improving the quality of our user database while maintaining all legitimate accounts.

This cleanup not only improved database performance but also enhanced the security and integrity of our OJS installation.

3 Likes

Thank you, @Alfredo_Cosco !
We also have this topic with other suggestions that might interest those who want to clean up the user base.

3 Likes

Hi, Thanks for the tips. I haven’t gotten this far into using automation.
I’ve experienced how annoying it is to deal with dozens of thousands of spam users on my several OJS installations, and mergeUsers.php really saved the day :muscle:

2 Likes

This topic was automatically closed after 10 days. New replies are no longer allowed.