Workflow for formatting extracted references

Domek · September 3, 2019, 10:56am

Hello,

I recently updated our journal from OJS 3.1.1.0 to 3.1.2.1. The editors noticed they cannot adjust the formatting of references on the article metadata page anymore. There is no WYSIWYG editor. The references of old articles have HTML tags and the new ones are as plaintext. The problem is that now they have to manually insert HTML tags into the references.

My question is what is the intended workflow for formatting references (i.e. italic font) for the view in the article page? Can I change behaviour of the extracting algorithm for certain parts of the reference to style it (i.e. name of the source in italic)? Or apply our CSL style to the extracted references?

Best regards,

Dominik

Domek · September 26, 2019, 2:34pm

Hi, I found a bug when the citation has URL not in <a href=… but simple in text link and the whole citation is encapsuled in <p></p>. The regex doesn’t catch the last > from </p>.

Citation like:
<p>Hulík, V., Hulíková Tesárková, K. & Hraba, J. (2016). Problematika neúspěšných ukončení vysokoškolského studia (drop-outs) v českém kontextu. Available at: http://kredo.reformy-msmt.cz/download/w-13-3/KREDO_prezentace_150313_2_Hulikova-Tesarkova-Hulik-Hraba.pdf.</p>

Extracted as:

<p>Hulík, V., Hulíková Tesárková, K. & Hraba, J. (2016). Problematika neúspěšných ukončení vysokoškolského studia (drop-outs) v českém kontextu. Available at: <a href="http://kredo.reformy-msmt.cz/download/w-13-3/KREDO_prezentace_150313_2_Hulikova-Tesarkova-Hulik-Hraba.pdf.">http://kredo.reformy-msmt.cz/download/w-13-3/KREDO_prezentace_150313_2_Hulikova-Tesarkova-Hulik-Hraba.pdf.</a></p>
>

Temporarily fixed it by adding </p> or </div> to the last capturing group in the regex in the Citation.inc.php.

function getCitationWithLinks() {
		$citation = $this->getRawCitation();
		if (stripos($citation, '<a href=') === false) {
			$citation = preg_replace(
				'#((https?|ftp):\/\/(\S*?\.\S*?))(([\s)\[\]{},;"\':<>])?(\.)?(\s|$|<\/p>|<\/div>))#i',
				//'#((https?|ftp)://(\S*?\.\S*?))(([\s)\[\]{},;"\':<>])?(\.)?(\s|$))#i',
				'<a href="$1">$1</a>$4',
				$citation
			);
		}
		return $citation;
	}

asmecher · September 29, 2019, 8:18pm

Hi @Domek,

Can you try the following replacement regular expression?

$citation = preg_replace(
    '|((https?://)?([\d\w\.-]+\.[\w\.]{2,6})[^\s\]\[\<\>]*(?=\.)/?)|i',
    '<a href="$1">$1</a>',
    $citation
);

If you can report back on whether this works, I’ll get it merged in for the next release.

Regards,
Alec Smecher
Public Knowledge Project Team

Domek · September 30, 2019, 1:53pm

Hi @asmecher,

I’ve been playing around with it. Your regex works perfectly if the user puts “.” at the end of each citation. When it’s missing the the extraction returns this:
references_link_regex

The previous regex with added most frequent closing line tags (</p>, </li>, </div>) works even when the dot at the end is missing.

I forgot in the previous code to put the </li>. So in my installation I currently use this.

$citation = preg_replace(
				'#((https?|ftp):\/\/(\S*?\.\S*?))(([\s)\[\]{},;"\':<>])?(\.)?(\s|$|<\/p>|<\/div>|<\/li>))#i',
				'<a href="$1">$1</a>$4',
				$citation
			);

Best regards,

Dominik

asmecher · October 1, 2019, 1:49pm

Hi @Domek,

Hrm, I’m sure there’s a way to do this with a pure regular expression, but I couldn’t manage it without more unwanted behavior. Please try this in place of the preg_replace call and let me know if it resolves all issues?

return preg_replace_callback(
    '|((https?://)?([\d\w\.-]+\.[\w\.]{2,6})[^\s\]\[\<\>]*/?)|i',
    function($matches) {
        $trailingDot = substr($matches[1], -1) == '.';
        $url = rtrim($matches[1], '.');
        return "<a href=\"$url\">$url</a>" . ($trailingDot?'.':'');
    },
    $citation
);

Thanks,
Alec Smecher
Public Knowledge Project Team

Domek · October 2, 2019, 9:56am

Hi @asmecher,

the regex was now capturing even standalone DOI identifiers in the citation and also some name abbreviations of authors.

citation_regex

I’ve changed the quantifier for the https group from ? to + and now it seems to be working fine.

The code for the whole function.

function getCitationWithLinks() {
		$citation = $this->getRawCitation();
		if (stripos($citation, '<a href=') === false) {
			return preg_replace_callback(
				'|((https?:\/\/)+([\d\w\.-]+\.[\w\.]{2,6})[^\s\]\[\<\>]*\/?)|i',
				function($matches) {
					$trailingDot = substr($matches[1], -1) == '.';
					$url = rtrim($matches[1], '.');
					return "<a href=\"$url\">$url</a>" . ($trailingDot?'.':'');
				},
				$citation
			);
		}
		return $citation;
	}

Sorry, I don’t know how to put colours in the code.

Best regards,

Domek

asmecher · October 2, 2019, 11:58am

Hi @Domek,

Thanks for your help testing this, I don’t have a representative dataset on hand

Changing the ? to a + might have unwanted side-effects; I’ve fine-tuned the regular expression to be pickier about the protocol clause of the URL (which is the effect your change had) and come up with this:

return preg_replace_callback(
    '#(http|https|ftp)://[\d\w\.-]+\.[\w\.]{2,6}[^\s\]\[\<\>]*/?#',
    function($matches) {
        $trailingDot = in_array($char = substr($matches[0], -1), array('.', ','));
        $url = rtrim($matches[0], '.,');
        return "<a href=\"$url\">$url</a>" . ($trailingDot?$char:'');
    },
    $citation
);

This will also work better when several URLs are separated by commas.

If you’re able to confirm, I’ll make sure this gets into the next release!

Regards,
Alec Smecher
Public Knowledge Project Team

Domek · October 2, 2019, 2:37pm

Mr. @asmecher,

no problem visible on my side. Everything extracted well! We’re done here.

Thank you for looking into it.

Best regards,

Dominik

asmecher · October 2, 2019, 3:05pm

Hi @Domek,

Thanks! I’ve created an issue and patch for this:

Regards,
Alec Smecher
Public Knowledge Project Team