HTML to DOM

如何将HTML转为DOM,尽管现在你可以使用 DOMParser 或者 XMLHttpRequest对象来完成此项工作, 但毕竟除Firefox以外的大部分浏览器都没有实现这些功能. 在这些功能被所有浏览器实现之前,你可以在自己的网站中使用下面的方法来完成HTML字符串转DOM的需求.

安全的将简单HTML解析为DOM

When using XMLHttpRequest to get the HTML of a remote webpage, it is often advantageous to turn that HTML string into DOM for easier manipulation. However, there are potential dangers involved in injecting remote content in a privileged context in your extension, so it can be desirable to parse the HTML safely.

The function below will safely parse simple HTML and return a DOM object which can be manipulated like web page elements. This will remove tags like <script>, <style>, <head>, <body>, <title>, and <iframe>. It will also remove all JavaScript, including element attributes that contain JavaScript.

function HTMLParser(aHTMLString){
  var html = document.implementation.createDocument("https://www.w3.org/1999/xhtml", "html", null),
    body = document.createElementNS("https://www.w3.org/1999/xhtml", "body");
  html.documentElement.appendChild(body);

  body.appendChild(Components.classes["@mozilla.org/feed-unescapehtml;1"]
    .getService(Components.interfaces.nsIScriptableUnescapeHTML)
    .parseFragment(aHTMLString, false, null, body));

  return body;
},

It works by creating a content-level (this is safer than chrome-level) <div> in the current page, then parsing the HTML fragment and attaching that fragment to the <div>. The <div> is returned, and it is never actually appended to the current page. The returned <body> object is of type Element

下面的例子可以计算出一个HTML字符串中包含的段落数.

var DOMPars = HTMLParser('<p>foo</p><p>bar</p>');
alert(DOMPars.getElementsByTagName('p').length);

If method HTMLParser() returns variable html (instead of body), you have all document object with its complete functions list, therefore you can retrieve info within div tag like this:

var DOMPars = HTMLParser("<div id='userInfo'>John was a mediocre programmer, but people liked him <strong>anyway</strong>.</div>");
alert(DOMPars.getElementById('userInfo').innerHTML);

To parse a complete HTML page, load it into an iframe whose type is content (not chrome). See Using a hidden iframe element to parse HTML to a window's DOM below.

使用隐藏的iframe标签将HTML解析为DOM

下面的例子代码还做了一些其他工作,例如:创建一个唯一name 和 ID 的函数等.

var frame = document.getElementById("sample-frame");
if (!frame) {
	// create frame
		frame = document.createElement("iframe"); // iframe (or browser on older Firefox)
		frame.setAttribute("id", "sample-frame");
		frame.setAttribute("name", "sample-frame");
		frame.setAttribute("type", "content");
		frame.setAttribute("collapsed", "true");
		document.getElementById("main-window").appendChild(frame);
		// or 
			// document.documentElement.appendChild(frame);

	// set restrictions as needed
		frame.webNavigation.allowAuth = false;
		frame.webNavigation.allowImages = false;
		frame.webNavigation.allowJavascript = false;
		frame.webNavigation.allowMetaRedirects = true;
		frame.webNavigation.allowPlugins = false;
		frame.webNavigation.allowSubframes = false;

	// listen for load
		frame.addEventListener("load", function (event) {
		  // the document of the HTML in the DOM
			var doc = event.originalTarget;
		  // skip blank page or frame
			if (doc.location.href == "about:blank" || doc.defaultView.frameElement) return;

		  // do something with the DOM of doc
		  	alert(doc.location.href);

		  // when done remove frame or set location "about:blank"
			  setTimeout(function (){
				  var frame = document.getElementById("sample-frame");
				  // remove frame
				  		// frame.destroy(); // if using browser element instead of iframe
						frame.parentNode.removeChild(frame);
					// or set location "about:blank"
						// frame.contentDocument.location.href = "about:blank";
			  },10);
		}, true);
} 


// load a page
	frame.contentDocument.location.href = "https://www.mozilla.org/"; 
	// or 
		// frame.webNavigation.loadURI("https://www.mozilla.org/",Components.interfaces.nsIWebNavigation,null,null,null);

如果你一开始获取到的是一个包含HTML的字符串,那么你可以将该字符串转换为data URI格式,然后用iframe打开这个data URI .

使用隐藏的XUL Iframe(备用方法)

Sometimes, a browser element is overkill, or does not meet your needs, or you can't fulfill its requirements. While working on Donkeyfire, I discovered the iframe XUL element, and it is very easy to implement it.

As an example, I will show a browser overlay .xul file, and some JavaScript code to access it.

Here is some XUL code you can add to your browser overlay .xul file. Don't forget to modify the id and name!

<vbox hidden="false" height="0">
  <iframe type="content" src="" name="donkey-browser" hidden="false" id="donkey-browser" height="0"/>
</vbox>

Then, in your extension's "load" event handler:

onLoad: function() {
	donkeybrowser = document.getElementById("donkey-browser");
	if (donkeybrowser) {
		donkeybrowser.style.height = "0px";
		donkeybrowser.webNavigation.allowAuth = true;
		donkeybrowser.webNavigation.allowImages = false;
		donkeybrowser.webNavigation.allowJavascript = false;
		donkeybrowser.webNavigation.allowMetaRedirects = true;
		donkeybrowser.webNavigation.allowPlugins = false;
		donkeybrowser.webNavigation.allowSubframes = false;
		donkeybrowser.addEventListener("DOMContentLoaded", function (e) { donkeyfire.donkeybrowser_onPageLoad(e); }, true);
	}

With that code, we obtain a reference to the iframe element we declared in the .xul file. The most interesting piece of code here is the DOMContentLoaded event listener we define for the element. Let's take a look at the donkeyfire.donkeybrowser_onPageLoad() handler:

donkeybrowser_onPageLoad: function(aEvent) {
	var doc = aEvent.originalTarget;
	var url = doc.location.href;
	if (aEvent.originalTarget.nodeName == "#document") { // ok, it's a real page, let's do our magic
		dump("[DF] URL = "+url+"\n");
		var text = doc.evaluate("/html/body/h1",doc,null,XPathResult.STRING_TYPE,null).stringValue;
		dump("[DF] TEXT in /html/body/h1 = "+text+"\n");
	}
},

As you can see, we obtain full access to the DOM of the page we loaded in background, and we can even evaluate XPath expressions. In the example, we dump() to the console the page's URL and the text contained in the first h1 tag of the page's <body>.

But, we still need to see how to execute the famous loadURI() method using our iframe:

donkeybrowser.webNavigation.loadURI("https://developer.mozilla.org",
              Components.interfaces.nsIWebNavigation, null, null, null);

另外,我建议你看看 nsIWebNavigation 接口.

HTML to DOM

安全的将简单HTML解析为DOM

使用隐藏的iframe标签将HTML解析为DOM

使用隐藏的XUL Iframe(备用方法)

文档标签和贡献者