Collaborative list of spam user patterns

Return spammers username:

01. Phone/Fax number starting by “12345%”

SELECT username FROM users 
WHERE phone LIKE "12345%";

False positives: Low.
This is usually safe, but sometimes real users add a fake phone/fax.

02. Users promoting webs in their bio:

SELECT username FROM users 
WHERE url LIKE "%porn%"
OR url LIKE "%loans%"
OR url LIKE "%games%"
OR url LIKE "%amoxicillin%"
OR url LIKE "%cheap%"
OR url LIKE "%buy%"

False positives: Middle.
Too much combinations, too short or too generic filters will return a lot of false positives.

03. Registered users with NO role

SELECT username FROM users 
WHERE user_id NOT IN (SELECT user_id FROM roles)

False positives: Low.
Would return false positives with the users that used to have a role in the journal but it was revoked.

04. Users that registered and login in less than 3 seconds

SELECT * FROM users 
WHERE user_id NOT IN (SELECT user_id FROM roles) 
AND (UNIX_TIMESTAMP(date_last_login) - UNIX_TIMESTAMP(date_registered) < 3) 
AND must_change_password = 0

False positives: ??
This is supposed to be safe because only robots can go so fast, but I’m getting a lot of false positives.

05. Same name, middle name and surname

SELECT username
FROM users 
WHERE last_name LIKE first_name
AND middle_name LIKE first_name

A variant, a little bit more risky would be:

SELECT username 
FROM users 
WHERE last_name LIKE CONCAT(first_name, "%") 
AND middle_name LIKE CONCAT(first_name, "%") 

False positives: Low.
You can also try without middle_name, but triple check is safer. The variant can match real users like “Steve Stevenson”

06. Phones with wrong length (11 digits)

SELECT * FROM users 
WHERE phone REGEXP "^[0-9]{11}$"

False positives: High
It’s quite common to se real users adding a fake phone/fax or with typing errors.

07. Mailing address with “@”

SELECT username FROM users 
WHERE `mailing_address` LIKE '%@%'

False positives: High
This one could fail a lot. As robots, some humans also are confused and add their mail to this field.

08. BBcode [url= signature

SELECT u.username FROM DATABASE.users u 
INNER JOIN DATABASE.user_settings us ON u.user_id=us.user_id 
WHERE us.setting_name="signature" 
AND us.setting_value LIKE "%[url=%";

To identify spam

A. Users from well-known spam domains:

SELECT username FROM users 
WHERE email LIKE "%eamale%"

B. List of most used domain names in mail.

SELECT substring_index(email, '@', -1) domain, COUNT(*) email_count
FROM users
GROUP BY substring_index(email, '@', -1)
ORDER BY email_count DESC, domain;

C. To describe user activity in order to identify suspect users (for ojs 2.x):

SELECT
    u.user_id,
    u.username,
    u.email,
    u.first_name,
    u.middle_name,
    u.last_name,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated,
    bit_or(r.role_id) roles,
    count(distinct a.article_id) articles,
    count(distinct e.article_id) editing,
    count(distinct v.review_id) reviews,
    group_concat(b.setting_value ORDER BY b.setting_value separator ' ') biography,
    group_concat(c.body ORDER BY c.body separator ' ') comments,
    group_concat(s.setting_value ORDER BY s.setting_value separator ' ') signature,
    count(distinct c.comment_id) comment_count
FROM
    users u
    left outer join user_settings b on (u.user_id = b.user_id and b.setting_name = 'biography')
    left outer join user_settings s on (u.user_id = s.user_id and s.setting_name = 'signature')
    left outer join comments c on (u.user_id = c.user_id)
    left outer join roles r on (u.user_id = r.user_id)
    left outer join articles a on (u.user_id = a.user_id and a.status in (1, 3))
    left outer join review_assignments v on (u.user_id = v.reviewer_id)
    left outer join edit_assignments e on (u.user_id = e.editor_id)
GROUP BY
    u.user_id,
    u.username,
    u.email,
    u.first_name,
    u.middle_name,
    u.last_name,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated;

As far as the data model changed, for 3.x the query could be something like:

SELECT 
    u.user_id,
    u.username,
    group_concat( DISTINCT(grp.user_group_id) ORDER BY grp.user_group_id separator '|') userGroupIDs,
    max(giv.setting_value) givenName,
    max(fam.setting_value) familyName,
    max(orc.setting_value) orcid,
    max(aff.setting_value) affiliation,
    u.email,
    u.url,
    u.disabled,
    u.date_registered,
    u.date_validated,
    max(bio.setting_value) as biography,
    max(com.body) comments,
    max(sig.setting_value) signature,
    count(distinct rev.review_id) reviews_count,
    count(distinct com.comment_id) comment_count
FROM
    users u
    left outer join user_settings giv on (u.user_id = giv.user_id and giv.setting_name = 'givenName')
    left outer join user_settings fam on (u.user_id = fam.user_id and fam.setting_name = 'familyName')
    left outer join user_settings orc on (u.user_id = orc.user_id and orc.setting_name = 'orcid')
    left outer join user_settings aff on (u.user_id = aff.user_id and aff.setting_name = 'affiliation')
    left outer join user_settings bio on (u.user_id = bio.user_id and bio.setting_name = 'biography')
    left outer join user_settings sig on (u.user_id = sig.user_id and sig.setting_name = 'signature')
    left outer join comments com on (u.user_id = com.user_id)
    left outer join review_assignments rev on (u.user_id = rev.reviewer_id)
    left outer join user_user_groups grp on (u.user_id = grp.user_id)
GROUP BY
    u.user_id,
    u.username,
    u.email,
    u.url,
    u.disabled,
    u.date_registered
LIMIT 10;

Warning: This query mixes a lot of data and form big DBs it query can take a loooong time.

D. Users with an URL in their bio

SELECT username FROM users 
WHERE url LIKE IS NOT NUL

Warning: It’s a real users like to add their real urls, but the fact is usually spammers create fake users to promote their webs so this is a good way to find fake users.

2 Likes