Decoding numeric html entities via PHP(通过 PHP 解码数字 html 实体)
问题描述
我有这个代码来将数字 html 实体解码为 UTF8 等效字符.
I have this code to decode numeric html entities to the UTF8 equivalent character.
我正在尝试转换这个字符:
I'm trying to convert this character:
’
应该输出:
’
然而,它只是消失了(没有输出).(我已经检查了页面的源代码,该页面具有正确的 utf8 字符集标题/元标记).
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
有人知道代码有什么问题吗?
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('’');
推荐答案
html_entity_decode
已经满足您的需求:
$string = '’';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
它将返回字符:
’ binary hex: c292
这是私人使用二 (U+0092).由于它是私人使用,您的 PHP 配置/版本/编译可能根本不会返回它.
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
还有一些怪癖:
但在 HTML 中(XHTML 除外,它使用 XML 规则),这是一个长期存在的浏览器怪癖,字符引用范围为 €
到 Ÿ
被误解为与 Windows 西方代码页 (cp1252) 中的字节 128 到 159 相关联的字符,而不是具有这些代码点的 Unicode 字符.HTML5 标准最终记录了这种行为.
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
€
toŸ
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
参见:’正在被 nokogiri 在 ruby on rails 中转换为u0092"
这篇关于通过 PHP 解码数字 html 实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:通过 PHP 解码数字 html 实体
- PHP - if 语句中的倒序 2021-01-01
- Laravel 5:Model.php 中的 MassAssignmentException 2021-01-01
- Oracle 即时客户端 DYLD_LIBRARY_PATH 错误 2022-01-01
- 覆盖 Magento 社区模块控制器的问题 2022-01-01
- 如何从数据库中获取数据以在 laravel 中查看页面? 2022-01-01
- 使用 GD 和 libjpeg 支持编译 PHP 2022-01-01
- 如何使用 Google API 在团队云端硬盘中创建文件夹? 2022-01-01
- 如何在 Symfony2 中正确使用 webSockets 2021-01-01
- openssl_digest vs hash vs hash_hmac?盐与盐的区别HMAC? 2022-01-01
- PHP foreach() 与数组中的数组? 2022-01-01