Collaborative list of spam user patterns

Whether you’ve inherited a new magazine or it’s time to put your house in order, in the forum you can find several good threads on how to deal with spam users but I miss an unified and updated list of spam patterns like the one @jmacgreg started in How do I combat spam?, so, here you have a forum-wiki-page here with an initial compilation that could be collectively extended/enriched.

I tried to make it compact and with same structure to facilitate the reading and are numbered for an easy reference. I also add a “False positives” field (with an estimation like “high/middle/low” probability) to describe in what situations you can get real users instead of spammers.

If you found a new pattern and you like to share it, when editing, please distinguish between queries that return spam users (with eventually a very few false positives) from the ones that help to identify them (list of users to dig further).

DISCLAIMER: Please, use this list with care because is not bullet prof, and you can get false positives. I’m so sorry I didn’t credit the original authors of those snippets. Kudos for them. You can find their original posts in the forum searching the published queries.

TIPS FOR NEWBIES
To facilitate the exportation to mergeUsers.php script, queries return “username”.
Use phpmyadmin or adminer (or simply redirect the output of the script to a file) to get a list.
Replace “username” with “count()" to get the number of affected users or with "” to get all data from one user.

Some links you would like to read if you are concerned about spam in OJS:

3 Likes

Return spammers username:

01. Phone/Fax number starting by “12345%”

SELECT username FROM users 
WHERE phone LIKE "12345%";

False positives: Low.
This is usually safe, but sometimes real users add a fake phone/fax.

02. Users promoting webs in their bio:

SELECT username FROM users 
WHERE url LIKE "%porn%"
OR url LIKE "%loans%"
OR url LIKE "%games%"
OR url LIKE "%amoxicillin%"
OR url LIKE "%cheap%"
OR url LIKE "%buy%"

False positives: Middle.
Too much combinations, too short or too generic filters will return a lot of false positives.

03. Registered users with NO role

SELECT username FROM users 
WHERE user_id NOT IN (SELECT user_id FROM roles)

False positives: Low.
Would return false positives with the users that used to have a role in the journal but it was revoked.

04. Users that registered and login in less than 3 seconds

SELECT * FROM users 
WHERE user_id NOT IN (SELECT user_id FROM roles) 
AND (UNIX_TIMESTAMP(date_last_login) - UNIX_TIMESTAMP(date_registered) < 3) 
AND must_change_password = 0

False positives: ??
This is supposed to be safe because only robots can go so fast, but I’m getting a lot of false positives.

05. Same name, middle name and surname

SELECT username
FROM users 
WHERE last_name LIKE first_name
AND middle_name LIKE first_name

A variant, a little bit more risky would be:

SELECT username 
FROM users 
WHERE last_name LIKE CONCAT(first_name, "%") 
AND middle_name LIKE CONCAT(first_name, "%") 

False positives: Low.
You can also try without middle_name, but triple check is safer. The variant can match real users like “Steve Stevenson”

06. Phones with wrong length (11 digits)

SELECT * FROM users 
WHERE phone REGEXP "^[0-9]{11}$"

False positives: High
It’s quite common to se real users adding a fake phone/fax or with typing errors.

07. Mailing address with “@”

SELECT username FROM users 
WHERE `mailing_address` LIKE '%@%'

False positives: High
This one could fail a lot. As robots, some humans also are confused and add their mail to this field.

08. BBcode [url= signature

SELECT u.username FROM DATABASE.users u 
INNER JOIN DATABASE.user_settings us ON u.user_id=us.user_id 
WHERE us.setting_name="signature" 
AND us.setting_value LIKE "%[url=%";

To identify spam

A. Users from well-known spam domains:

SELECT username FROM users 
WHERE email LIKE "%eamale%"

B. List of most used domain names in mail.

SELECT substring_index(email, '@', -1) domain, COUNT(*) email_count
FROM users
GROUP BY substring_index(email, '@', -1)
ORDER BY email_count DESC, domain;

C. To describe user activity in order to identify suspect users (for ojs 2.x):

SELECT
    u.user_id,
    u.username,
    u.email,
    u.first_name,
    u.middle_name,
    u.last_name,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated,
    bit_or(r.role_id) roles,
    count(distinct a.article_id) articles,
    count(distinct e.article_id) editing,
    count(distinct v.review_id) reviews,
    group_concat(b.setting_value ORDER BY b.setting_value separator ' ') biography,
    group_concat(c.body ORDER BY c.body separator ' ') comments,
    group_concat(s.setting_value ORDER BY s.setting_value separator ' ') signature,
    count(distinct c.comment_id) comment_count
FROM
    users u
    left outer join user_settings b on (u.user_id = b.user_id and b.setting_name = 'biography')
    left outer join user_settings s on (u.user_id = s.user_id and s.setting_name = 'signature')
    left outer join comments c on (u.user_id = c.user_id)
    left outer join roles r on (u.user_id = r.user_id)
    left outer join articles a on (u.user_id = a.user_id and a.status in (1, 3))
    left outer join review_assignments v on (u.user_id = v.reviewer_id)
    left outer join edit_assignments e on (u.user_id = e.editor_id)
GROUP BY
    u.user_id,
    u.username,
    u.email,
    u.first_name,
    u.middle_name,
    u.last_name,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated;

As far as the data model changed, for 3.x the query could be something like:

SELECT 
    u.user_id,
    u.username,
    group_concat( DISTINCT(grp.user_group_id) ORDER BY grp.user_group_id separator '|') userGroupIDs,
    max(giv.setting_value) givenName,
    max(fam.setting_value) familyName,
    max(orc.setting_value) orcid,
    max(aff.setting_value) affiliation,
    u.email,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated,
    max(bio.setting_value) as biography,
    max(com.body) comments,
    max(sig.setting_value) signature,
    count(distinct rev.review_id) reviews_count,
    count(distinct com.comment_id) comment_count
FROM
    users u
    left outer join user_settings giv on (u.user_id = giv.user_id and giv.setting_name = 'givenName')
    left outer join user_settings fam on (u.user_id = fam.user_id and fam.setting_name = 'familyName')
    left outer join user_settings orc on (u.user_id = orc.user_id and orc.setting_name = 'orcid')
    left outer join user_settings aff on (u.user_id = aff.user_id and aff.setting_name = 'affiliation')
    left outer join user_settings bio on (u.user_id = bio.user_id and bio.setting_name = 'biography')
    left outer join user_settings sig on (u.user_id = sig.user_id and sig.setting_name = 'signature')
    left outer join comments com on (u.user_id = com.user_id)
    left outer join review_assignments rev on (u.user_id = rev.reviewer_id)
    left outer join user_user_groups grp on (u.user_id = grp.user_id)
GROUP BY
    u.user_id,
    u.username,
    u.email,
    u.url,
    u.disabled,
    u.date_registered
LIMIT 10;

Warning: This query mixes a lot of data and form big DBs it query can take a loooong time.

D. Users with an URL in their bio

SELECT username FROM users 
WHERE url LIKE IS NOT NUL

Warning: It’s a real users like to add their real urls, but the fact is usually spammers create fake users to promote their webs so this is a good way to find fake users.

2 Likes

A quite usual petition is making a DB extraction of the platform users including roles.
To do this you can use Clinton’s query (see “C. To describe user activity in order to identify suspect users (for ojs 2.x)” but you will get a number instead of a role name.

This number is the decimal equivalence for the different rolenames (coded in binary) and in 2.x branches would be something like:

Role Name role_id
Admin 1
Journal Managers 16
Editors 256
Section Editors 512
Layout Editors 768
Reviewers 4096
Copyeditors 8196
Authors 65536
Readers 1048576
Subscription Managers 2097152
IMPORTANT: Those equivalences change between versions so use this information with caution.

For 3.3.x branches, to get ALL the constants, you can use the “constant.php” function included in tools called as follows:

$ php lib/pkp/tools/constants.php

Or if you want to know what constant matches a particular value, you can “grep” or run:

$ php lib/pkp/tools/constants.php 256

(Thanks to @NateWr for the tool and the tip)

3 Likes

This is an old thread but very nice work from Marc.

The search of users by role applies to OJS2. For OJS3 this would be

SELECT username FROM users 
WHERE user_id NOT IN (SELECT user_id FROM user_user_groups) 

You could also consider stage_assignments and review_assignments:

SELECT username FROM users 
WHERE user_id NOT IN (SELECT user_id FROM user_user_groups) 
AND user_id NOT IN (SELECT user_id FROM stage_assignments) 
AND user_id NOT IN (SELECT reviewer_id FROM review_assignments) 
AND user_id NOT IN (SELECT editor_id FROM edit_decisions)
2 Likes

Thanks Antii-Jussi. :wink:

Former list is a wiki, so feel free to edit to add the changes.

Take care,
m.

This sentence can be useful

echo 'SELECT u.username FROM DATABASE.users u 
INNER JOIN DATABASE.user_settings us 
ON u.user_id=us.user_id 
WHERE us.setting_name="signature" 
AND us.setting_value LIKE "%[url=%";' | mysql > bots.txt

Or this:

echo 'SELECT u.username FROM DATABASE.users u 
INNER JOIN DATABASE.user_settings us 
ON u.user_id=us.user_id 
WHERE us.setting_name="givenName" 
AND us.setting_value=CONCAT(u.username, " ", u.username) \G' | mysql
2 Likes

Hi Juanito,

I have a doubt with the first one: Sometimes real users add urls in their signature and they will be banned too, isn’t it?

The “bots.txt” ending made me think that would be nice to to create an script to feed your fail2ban blacklist with the IPs found by spam patterns.

Cheers,
m.

Hola Marc.

En este caso, ellos están usando notación del tipo, en sus firmas

[url=http://spam]Your are the best[/url]

Presupuse que es una formación típica que usan en sistemas de foros o comentarios donde se permite el ‘BBcode’

1 Like

Ah, pues tienes razón… si usan sintaxis BBcode va a ser spam si o si.

Gracias por el aporte y por la aclaración.
¿Añades tus patrones a la lista o prefieres que lo haga yo?

Un saludo,
m.

Recently some journals asked to clean up their OJS 3.2 so I updated Clinton’s query to 3.x data model. This is the resultant slow&dirty query:

SELECT 
    u.user_id,
    u.username,
    group_concat( DISTINCT(grp.user_group_id) ORDER BY grp.user_group_id separator '|') userGroupIDs,
    max(giv.setting_value) givenName,
    max(fam.setting_value) familyName,
    max(orc.setting_value) orcid,
    max(aff.setting_value) affiliation,
    u.email,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated,
    max(bio.setting_value) as biography,
    max(com.body) comments,
    max(sig.setting_value) signature,
    count(distinct rev.review_id) reviews_count,
    count(distinct com.comment_id) comment_count
FROM
    users u
    left outer join user_settings giv on (u.user_id = giv.user_id and giv.setting_name = 'givenName')
    left outer join user_settings fam on (u.user_id = fam.user_id and fam.setting_name = 'familyName')
    left outer join user_settings orc on (u.user_id = orc.user_id and orc.setting_name = 'orcid')
    left outer join user_settings aff on (u.user_id = aff.user_id and aff.setting_name = 'affiliation')
    left outer join user_settings bio on (u.user_id = bio.user_id and bio.setting_name = 'biography')
    left outer join user_settings sig on (u.user_id = sig.user_id and sig.setting_name = 'signature')
    left outer join comments com on (u.user_id = com.user_id)
    left outer join review_assignments rev on (u.user_id = rev.reviewer_id)
    left outer join user_user_groups grp on (u.user_id = grp.user_id)
GROUP BY
    u.user_id,
    u.username,
    u.email,
    u.url,
    u.disabled,
    u.date_registered
LIMIT 10;

WARNING: Using this extraction to do a mailing is against the GDPR unless you:

  1. be sure this practice is explained in your privacy terms page.
  2. only mail to users that explicitly accepted your terms (see).
  3. don’t use third party services (like mailchimp). Servers need to be in the EU.

About the groups (related but not the same than roles) here you have a list of id/name equivalences:

id name
2 Journal manager
3 Journal editor
4 Production editor
5 Section editor
6 Guest editor
7 Copyeditor
8 Designer
9 Funding coordinator
10 Indexer
11 Layout Editor
12 Marketing and sales coordinator
13 Proofreader
14 Author
15 Translator
16 Reviewer
17 Reader
18 Subscription Manager

Cheers,
m.

1 Like
SELECT * FROM `users` WHERE email=mailing_address; 
1 Like