How do I combat SPAM?

jmacgreg · April 16, 2018, 11:42am

There are a number of configuration steps you can take to combat SPAM and other forms of malicious registration activity on your site.

CAPTCHA/ReCAPTCHA

Configure a CAPTCHA and enable it for user registration, notifications and commenting. We strongly advise using Google’s ReCAPTCHA (reCAPTCHA). (Older OJS 2 versions include a PHP captcha, but that has proven to be less secure than ReCaptcha.)

If you are using OJS older than 2.4.8-3, you will not be able to use the most recent version of ReCAPTCHA (v2), and your ReCAPTCHA will not work properly after March 31 2018. Upgrading is strongly recommended.

To configure ReCAPTCHA:

Register an account and create a “property” for your website here: reCAPTCHA
Ensure that the relevant lines for enabling captcha are uncommented in config.inc.php
Copy and paste the public and private keys you receive as part of the property registration into the [captcha] section of config.inc.php:

recaptcha_public_key = 123456abcdef
Recaptcha_private_key = abcdef123456

Don’t forget to test this setup by registering a test account and confirming that a) the ReCAPTCHA appears and b) the ReCAPTCHA properly validates.

Enable Account Validation

OJS can be configured so that an email account validation step must be completed for all new user accounts before they can log in and interact with the system. To do this, uncomment and configure the following lines in config.inc.php:

; If enabled, email addresses must be validated before login is possible.
require_validation = On

; Maximum number of days before an unvalidated account expires and is deleted
validation_timeout = 14

The above configuration will require all new registrations to click on a link and validate their account before being able to log in; and will auto-prune any non-validated accounts after 14 days.

Cleaning lots of users

If you have been the target a SPAM bot, enabling the above procedures may not be enough: you may already have a fair number of SPAM accounts in your system. The only way to “delete” users is to merge the problem account into another account using the Merge Users option. (This effectively deletes the problem user. Any submissions, editorial history, etc. from the problem user is merged into the other user account.)

This tool can be used via the UI, but it’s slow (and only OJS/OCS 2 currently have an option to merge more than one user at a time). A more effective method is to use the command-line tool:

$ php tools/mergeUsers.php username1 username2

… where username1 is the user that will be merged into, and username2 is the user to be deleted. As it is, this tool only works on one merge at a time, but it can be scripted. An example php script would be:

<?php

$names = file('/tmp/names.txt', FILE_SKIP_EMPTY_LINES);

foreach ($names as $member_name => $member) {
        echo exec ("php /ojswebroot/tools/mergeUsers.php admin-user " .escapeshellcmd($member));
}
?>

The script expects all spam accounts to be identified by username and listed in a names.txt file, one name per line, like so:

spamuserOne
spamUserTwo
spamUserThree
…

The names.txt file has to be stored somewhere on the server and the location referenced by the script (eg. “/tmp/names.txt”). The script should also specify the location of the mergeUsers.php script (eg. “/ojswebroot/tools/mergeUsers.php”), and also the user into which all of these accounts should be merged (eg. “admin-user” - this must be an existing account). Update those parameters to suit your environment. And also: don’t store this script, or the names.txt file, in a web-accessible location!

ambs · April 16, 2018, 4:32pm

The problem here is to get the list of spammers.
How easy would it be to delete an account unless:

It is an editor
It is a reviewer
It has submitted a paper

I think I would be happy enough with this, as the journal I manage is open access since ever, so nobody registered to read it. So, users that had register are likely those who did any of the above tasks.

Ideas? Thanks!

jmacgreg · April 18, 2018, 3:26pm

Hi @ambs, and apologies for not responding sooner! Here’s how we @ PKP go about getting lists. It requires access to MySQL though.

First, we look at the user accounts and determine if there’s some common identifying characteristic for the spam accounts. Quite often this is easy: there’s a bot out there that as part of the registration process puts “123456” in the phone and sometimes fax fields. So we get all usernames that match this:

SELECT * FROM users WHERE phone = 123456;

This will return all users that have “123456” as the phone number, in a list. Review the list for possible falst positives, and copy the problem usernames to names.txt.

You can do the same for other suspicious-looking data, for example, if you see multiple users registering with emails that contain the same weird suffix, such as “eamale.com” or “yandex.com”, you can do the following:

SELECT * FROM users WHERE email LIKE "%eamale%"

We’ve also seen a situation with another spambot where they fill in the phone field with an 11-digit string, like “84286848777”. You can select for this as well:

SELECT * FROM users WHERE phone REGEXP “^[0-9]{11}$”

You can also select all users registered in the system who don’t have roles assigned to them, which is close to what you are asking:

SELECT * FROM users WHERE user_id NOT IN (SELECT user_id FROM roles)

This will return all user information for users that don’t have any role, including author, reader, reviewer, etc.

One caution about this process: you will want to look through the results carefully to make sure that you aren’t flagging any sort of false positives; otherwise, you could be deleting a user that is actually not a spammer.

Cheers,
James

ambs · April 18, 2018, 5:24pm

Hi. Thank you. The last select will probably be the one I will go with.
As far as I can tell, deleting a user will not mess with the article metadata, so it should be pacific if I delete a user by mistake

ambs · April 19, 2018, 5:29pm

Curiously, most users have roles. From 6K users, just 100 does not have a role :-/

ambs · April 19, 2018, 5:37pm

Using phone like “%12345%” removed 70% of the spammer. Nice.

jmacgreg · April 19, 2018, 10:37pm

Re: the user role question: registrants may be auto-enrolling as authors or readers. I may be able to provide a sql query that gets these. Any idea whether people are mostly registering as authors, or readers, or both?

James

ambs · April 20, 2018, 8:38pm

Both, I think.

+---------+----------+
| role_id | COUNT(1) |
+---------+----------+
|       1 |        1 |
|      16 |        2 |
|     256 |        3 |
|     512 |        3 |
|     768 |        1 |
|    4096 |       43 |
|   65536 |     6360 |
| 1048576 |     6818 |
+---------+----------+

ambs · April 22, 2018, 1:10pm

Other good heuristic:

delete from users where last_name LIKE CONCAT(first_name, "%");

asmecher · April 23, 2018, 1:48pm

Hi @ambs,

Watch out for common names like Sven Svensson.

Regards,
Alec Smecher
Public Knowledge Project Team

ajnyga · July 30, 2018, 7:39am

Great instructions here, thanks! I just removed 6000 spam accounts.

Maybe OJS could have a inbuilt honeypot field? I mean just add a text field hidden with CSS to the registration form and a corresponding field to the users table. Then you could just check for users that have some value in that field and be fairly sure it is a bot.

Eirik_Hanssen · September 26, 2018, 7:26am

I would like to support ajnyga’s suggestion of having an inbuilt honeypot field. This would very easily and quickly let us locate most of the spam accounts.

kaitlin · September 26, 2018, 12:39pm

Another query that can be useful is one to find the email address domains used the most, which can help identify suspicious domains that have the most spam users.

SELECT substring_index(email, '@', -1) domain, COUNT(*) email_count
FROM users
GROUP BY substring_index(email, '@', -1)

-- If you want to sort as well:
ORDER BY email_count DESC, domain;

ctgraham · October 15, 2018, 12:51pm

We created a formHoneypot plugin which tags an existing field or adds a new field as a honeypot. We used the honeypot to directly deny registration rather than to flag it for removal later. I’d be interested in hearing thoughts from @ajnyga and @Eirik_Hanssen regarding why we might want to allow the registration (rather than deny it directly) as we prepare to port this from 2.x. to 3.x.

Also tagging @AlexWreschnig.

ajnyga · October 15, 2018, 12:59pm

I was mainly thinking whether bots will learn not to fill the honeypot field. If the removal is done a bit later, they probably will not catch on, but if the registration fails immediately, they might? This is purely speculative.

(but very nice plugin!)

AlexWreschnig · October 15, 2018, 1:21pm

Bots typically don’t learn… doesn’t mean they won’t start in the future, but it’s been a very safe approach so far?

marc · February 2, 2019, 12:13am

Yesterday a fellow asked my about how to deal with fake spam users… and I apply the 3 suggested methods:

Replace with a better captcha typo (Punktype | dafont.com)
Set “require_validation = On” and “validation_timeout = 20” in config.inc.php
And install @ctgraham 's formHoneypot plugin (really smart solution. thanks a lot!)

No spam users in the last 24 hours.
No need to put my users working for Google for free (as you do with “reCaptcha”).

Thanks!

Mauricio_Adriano · March 13, 2019, 5:40pm

Cool Kaitlin!
I’m adding the same SQL but referring to Postgres:

SELECT substring(email from '@.') as domain, COUNT() email_count
FROM users
GROUP BY 1
order by 2 desc

Mauricio_Adriano · March 13, 2019, 5:52pm

About HoneyPot:
It worked, there was a substancial reduction in the number of new daily users. However, some spambots user insertions happens because they don’t fill the field of the honeypot. A suggestion is to add a second optional field as honeypot’s field…

About enable account validation:
Enabling account validation, it uses the values of the user’s table fields “date_registered”, “date_validated”, “date_last_login” and “disabled”, am I right? Could be built a SQL based on these fields to remove users?

tabber · June 3, 2020, 8:29am

[OJS3.1] we have 1000s of fake accounts which are made on a certain date and never have logged in a journal. I am not an expert. Can I use:

SELECT * FROM users WHERE CAST(date_registered AS DATE) = CAST(date_last_login as date)