How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it?(如何使用 XMLHttpRequest 在后台下载 HTML 页面并从中提取文本元素?)
问题描述
我想制作一个 Greasemonkey 脚本,当您在 URL_1 中时,该脚本会在后台解析 URL_2 的整个 HTML 网页,以便从中提取文本元素.
I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.
具体来说,我想在后台下载整个页面的HTML代码(一个烂番茄页面)并将其存储在一个变量中,然后使用getElementsByClassName[0]
以便从类名为critic_consensus"的元素中提取我想要的文本.
To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0]
in order to extract the text I want from the element with class name "critic_consensus".
我在 MDN 中找到了这个:XMLHttpRequest 中的 HTML所以,我最终得到了这个不幸的非工作代码:
I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:
var xhr = new XMLHttpRequest();
xhr.onload = function() {
alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes.com/m/godfather/",true);
xhr.responseType = "document";
xhr.send();
当我在 Firefox Scratchpad 中运行它时,它会显示此错误消息:
It shows this error message when I run it in Firefox Scratchpad:
跨域请求被阻止:同源策略不允许读取http://www.rottentomatoes.com/m/godfather/ 的远程资源.这可以通过将资源移动到同一域或启用 CORS.
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes.com/m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.
PS.我不使用烂番茄 API 的原因是 他们已经删除了批评者的共识.
推荐答案
对于跨域请求,获取的站点没有帮助设置许可CORS 策略,Greasemonkey 提供 GM_xmlhttpRequest()
函数.(大多数其他用户脚本引擎也提供此功能.)
For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest()
function. (Most other userscript engines also provide this function.)
GM_xmlhttpRequest
明确设计为允许跨域请求.
GM_xmlhttpRequest
is expressly designed to allow cross-origin requests.
要获取您的目标信息,请在结果上创建一个 DOMParser
.不要使用 jQuery 方法,因为这会导致加载无关的图像、脚本和对象、减慢速度或使页面崩溃.
To get your target information create a DOMParser
on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.
这里有一个完整的脚本来说明这个过程:
Here's a complete script that illustrates the process:
// ==UserScript==
// @name _Parse Ajax Response for specific nodes
// @include http://stackoverflow.com/questions/*
// @require http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js
// @grant GM_xmlhttpRequest
// ==/UserScript==
GM_xmlhttpRequest ( {
method: "GET",
url: "http://www.rottentomatoes.com/m/godfather/",
onload: function (response) {
var parser = new DOMParser ();
/* IMPORTANT!
1) For Chrome, see
https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
for a work-around.
2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
*/
var doc = parser.parseFromString (response.responseText, "text/html");
var criticTxt = doc.getElementsByClassName ("critic_consensus")[0].textContent;
$("body").prepend ('<h1>' + criticTxt + '</h1>');
},
onerror: function (e) {
console.error ('**** error ', e);
},
onabort: function (e) {
console.error ('**** abort ', e);
},
ontimeout: function (e) {
console.error ('**** timeout ', e);
}
} );
这篇关于如何使用 XMLHttpRequest 在后台下载 HTML 页面并从中提取文本元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何使用 XMLHttpRequest 在后台下载 HTML 页面并从中提取文本元素?
- 使用RSelum从网站(报纸档案)中抓取多个网页 2022-09-06
- addEventListener 在 IE 11 中不起作用 2022-01-01
- Quasar 2+Apollo:错误:找不到ID为默认的Apollo客户端。如果您在组件设置之外,请使用ProvideApolloClient() 2022-01-01
- Css:将嵌套元素定位在父元素边界之外一点 2022-09-07
- 失败的 Canvas 360 jquery 插件 2022-01-01
- Fetch API 如何获取响应体? 2022-01-01
- Flexslider 箭头未正确显示 2022-01-01
- CSS媒体查询(最大高度)不起作用,但为什么? 2022-01-01
- 如何使用 JSON 格式的 jQuery AJAX 从 .cfm 页面输出查 2022-01-01
- 400或500级别的HTTP响应 2022-01-01