Charset problems (ISO-8859-1 x UTF-8)

wilsonw · May 21, 2021, 4:55pm

Thank you for sharing your leads on this. After upgrading to 3.3 and setting the utf-8 charset according to the config template, we found our database to have past/historical data with mixed charsets. New data is saved properly. To remedy old data, I found titles, subtitles, and abstracts in the publication_settings table (among other places) to have the â€* characters stored at the database level.

SELECT * FROM `publication_settings` WHERE `setting_value` LIKE '%â€%'

Screen Shot 2021-05-21 at 9.53.34 AM

With your reference to ftfy I ran ftfy.fix_text() and resolved a few publications by updating the database manually. Since there are 600+ cases with characters of mixed encoding, I’m planning to run a loop through that resulting dataset and fixing the text via ftfy. Since this worked manually for a few publications, I’m fairly certain automating the rest should work. Is there anything I should be careful of before proceeding, or can you confirm that this should work in theory?

Thank you!

Edit: looks like we have these mixed encodings all over the place.

Search results for "%â€%" at least one of the words:
3 matches in announcement_settings
9 matches in author_settings
30 matches in citations
9 matches in comments
10 matches in controlled_vocab_entry_settings
1143 matches in email_log
64 matches in email_templates_default_data
15 matches in email_templates_settings
41 matches in event_log_settings  
3 matches in issue_settings
25 matches in journal_settings
1 match in navigation_menu_item_settings
208 matches in notes
3 matches in publication_galleys
598 matches in publication_settings
21 matches in rt_searches
2 matches in section_settings
272 matches in submission_comments
36 matches in submission_file_settings
7967 matches in submission_search_keyword_list
4 matches in user_settings
Total: 10464 matches

Wilson