DOMDocument クラス

(PHP 5, PHP 7, PHP 8)

はじめに

HTML ドキュメントあるいは XML ドキュメント全体を表し、ドキュメントツリーのルートとなります。

クラス概要

class DOMDocument extends DOMNode implements DOMParentNode {

/* 継承した定数 */

public const int DOMNode::DOCUMENT_POSITION_DISCONNECTED = 0x1;

public const int DOMNode::DOCUMENT_POSITION_PRECEDING = 0x2;

public const int DOMNode::DOCUMENT_POSITION_FOLLOWING = 0x4;

public const int DOMNode::DOCUMENT_POSITION_CONTAINS = 0x8;

public const int DOMNode::DOCUMENT_POSITION_CONTAINED_BY = 0x10;

public const int DOMNode::DOCUMENT_POSITION_IMPLEMENTATION_SPECIFIC = 0x20;

/* プロパティ */

public readonly ?DOMDocumentType $doctype;

public readonly DOMImplementation $implementation;

public readonly ?DOMElement $documentElement;

public readonly ?string $actualEncoding;

public ?string $encoding;

public readonly ?string $xmlEncoding;

public bool $standalone;

public bool $xmlStandalone;

public ?string $version;

public ?string $xmlVersion;

public bool $strictErrorChecking;

public ?string $documentURI;

public readonly mixed $config;

public bool $formatOutput;

public bool $validateOnParse;

public bool $resolveExternals;

public bool $preserveWhiteSpace;

public bool $recover;

public bool $substituteEntities;

public readonly ?DOMElement $firstElementChild;

public readonly ?DOMElement $lastElementChild;

public readonly int $childElementCount;

/* 継承したプロパティ */

public readonly string $nodeName;

public ?string $nodeValue;

public readonly int $nodeType;

public readonly ?DOMNode $parentNode;

public readonly ?DOMElement $parentElement;

public readonly DOMNodeList $childNodes;

public readonly ?DOMNode $firstChild;

public readonly ?DOMNode $lastChild;

public readonly ?DOMNode $previousSibling;

public readonly ?DOMNode $nextSibling;

public readonly ?DOMNamedNodeMap $attributes;

public readonly bool $isConnected;

public readonly ?DOMDocument $ownerDocument;

public readonly ?string $namespaceURI;

public string $prefix;

public readonly ?string $localName;

public readonly ?string $baseURI;

public string $textContent;

/* メソッド */

public function __construct(string $version = "1.0", string $encoding = "")

public function adoptNode(DOMNode $node): DOMNode|false

public function append(DOMNode|string ...$nodes): void

public function createAttribute(string $localName): DOMAttr|false

public function createAttributeNS(?string $namespace, string $qualifiedName): DOMAttr|false

public function createCDATASection(string $data): DOMCdataSection|false

public function createComment(string $data): DOMComment

public function createDocumentFragment(): DOMDocumentFragment

public function createElement(string $localName, string $value = ""): DOMElement|false

public function createElementNS(?string $namespace, string $qualifiedName, string $value = ""): DOMElement|false

public function createEntityReference(string $name): DOMEntityReference|false

public function createProcessingInstruction(string $target, string $data = ""): DOMProcessingInstruction|false

public function createTextNode(string $data): DOMText

public function getElementById(string $elementId): ?DOMElement

public function getElementsByTagName(string $qualifiedName): DOMNodeList

public function getElementsByTagNameNS(?string $namespace, string $localName): DOMNodeList

public function importNode(DOMNode $node, bool $deep = false): DOMNode|false

public function load(string $filename, int $options = 0): bool

public function loadHTML(string $source, int $options = 0): bool

public function loadHTMLFile(string $filename, int $options = 0): bool

public function loadXML(string $source, int $options = 0): bool

public function normalizeDocument(): void

public function prepend(DOMNode|string ...$nodes): void

public function registerNodeClass(string $baseClass, ?string $extendedClass): true

public function relaxNGValidate(string $filename): bool

public function relaxNGValidateSource(string $source): bool

public function replaceChildren(DOMNode|string ...$nodes): void

public function save(string $filename, int $options = 0): int|false

public function saveHTML(?DOMNode $node = null): string|false

public function saveHTMLFile(string $filename): int|false

public function saveXML(?DOMNode $node = null, int $options = 0): string|false

public function schemaValidate(string $filename, int $flags = 0): bool

public function schemaValidateSource(string $source, int $flags = 0): bool

public function validate(): bool

public function xinclude(int $options = 0): int|false

/* 継承したメソッド */

public function DOMNode::appendChild(DOMNode $node): DOMNode|false

public function DOMNode::C14N(
    bool $exclusive = false,
    bool $withComments = false,
    ?array $xpath = null,
    ?array $nsPrefixes = null
): string|false

public function DOMNode::C14NFile(
    string $uri,
    bool $exclusive = false,
    bool $withComments = false,
    ?array $xpath = null,
    ?array $nsPrefixes = null
): int|false

public function DOMNode::cloneNode(bool $deep = false): DOMNode|false

public function DOMNode::compareDocumentPosition(DOMNode $other): int

public function DOMNode::contains(DOMNode|DOMNameSpaceNode|null $other): bool

public function DOMNode::getLineNo(): int

public function DOMNode::getNodePath(): ?string

public function DOMNode::getRootNode(?array $options = null): DOMNode

public function DOMNode::hasAttributes(): bool

public function DOMNode::hasChildNodes(): bool

public function DOMNode::insertBefore(DOMNode $node, ?DOMNode $child = null): DOMNode|false

public function DOMNode::isDefaultNamespace(string $namespace): bool

public function DOMNode::isEqualNode(?DOMNode $otherNode): bool

public function DOMNode::isSameNode(DOMNode $otherNode): bool

public function DOMNode::isSupported(string $feature, string $version): bool

public function DOMNode::lookupNamespaceURI(?string $prefix): ?string

public function DOMNode::lookupPrefix(string $namespace): ?string

public function DOMNode::normalize(): void

public function DOMNode::removeChild(DOMNode $child): DOMNode|false

public function DOMNode::replaceChild(DOMNode $node, DOMNode $child): DOMNode|false

public function DOMNode::__sleep(): array

public function DOMNode::__wakeup(): void

}

プロパティ

actualEncoding: PHP8.4.0より非推奨。ドキュメントの実際のエンコーディング。読み込み専用で、 encoding と同等の内容です。
childElementCount: 子要素の数
config: PHP8.4.0より非推奨。 DOMDocument::normalizeDocument() を実行する際に使用する設定。
doctype: このドキュメントに関連付けられた文書型宣言
documentElement: 最初のドキュメント要素を示す DOMElement オブジェクト。存在しない場合は null になります。
documentURI: ドキュメントの位置。未定義の場合は null
encoding: XML 宣言で指定したドキュメントのエンコーディング。この属性は、DOM Level 3 の最終的な仕様には存在しません。しかし、この実装で XML ドキュメントのエンコーディングを扱うにはこれを使用するしかありません。
firstElementChild: 最初の子要素。存在しない場合は null になります。
formatOutput: 字下げや空白を考慮してきれいに整形した出力を行う。これは、ドキュメントを preserveWhitespace を有効にして読み込んだ場合は効果がありません。
implementation: このドキュメントを処理する DOMImplementation オブジェクト
lastElementChild: 最後の子要素。存在しない場合は null になります。
preserveWhiteSpace: 余分な空白を取り除かない。デフォルトは true false に設定すると、 DOMDocument::load() の option に LIBXML_NOBLANKS を渡すのと同じ効果があります。
recover: 非標準。リカバリーモードを有効にし、整形式でないドキュメントのパースを試みます。この属性は DOM の仕様にはなく、libxml 固有の独自仕様です。
resolveExternals: 文書型宣言で外部エンティティを読み込む際に true を設定する。 XML ドキュメントに文字エンティティを含める際に便利です。
standalone: 非推奨。そのドキュメントがスタンドアローンかどうかを XML 宣言で指定したもの。 xmlStandalone に対応します。
strictErrorChecking: エラー時に DOMException をスローする。デフォルトは true
substituteEntities: 非標準。エンティティの置換を行うかどうか。この属性は DOM の仕様にはなく、libxml 固有の独自仕様です。デフォルトは false です。

警告
エンティティの置換を有効にすると、XML外部エンティティ参照攻撃(XXE) を容易にしてしまうかもしれません。
validateOnParse: DTD を読み込んで検証する。デフォルトは false

警告
DTD の検証を有効にすると、XML外部エンティティ参照攻撃(XXE) を容易にしてしまうかもしれません。
version: 非推奨。 XML のバージョン。 xmlVersion に対応します。
xmlEncoding: XML 宣言の一部として、このドキュメントのエンコーディングを指定する属性。指定されていない場合や不明な場合 (たとえばドキュメントがメモリ上に存在する場合など) は null
xmlStandalone: XML 宣言の一部として、このドキュメントがスタンドアローンかどうかを指定する。指定されていない場合は false スタンドアローンドキュメントとは、外部のマークアップ宣言を持たないドキュメントのことです。スタンドアローンドキュメントの例としては、 DTD がデフォルトの値で宣言されているものが挙げられます。
xmlVersion: XML 宣言の一部として、このドキュメントのバージョン番号を指定する。バージョン番号が定義されておらず、ドキュメントが "XML" の機能をサポートしている場合は、値は "1.0"

変更履歴

バージョン	説明
8.4.0	`actualEncoding` と `config` は正式に非推奨となりました。
8.0.0	DOMDocument は、 DOMParentNode を実装しました。
8.0.0	実装されていなかったメソッド DOMDocument::renameNode() が削除されました。

注意

注意: DOM拡張モジュールは UTF-8 エンコーディングを使います。他のエンコーディングを扱う場合は、mb_convert_encoding(), UConverter::transcode(), iconv() を使ってください。

注意: DOMDocument オブジェクトに対して json_encode() を使うと、結果は空オブジェクトをエンコードしたものになります。

参考

» W3C specification for Document

DOMDocument::adoptNode — ノードを別のドキュメントに移す
DOMDocument::append — 最後の子ノードの後ろにノードを追加する
DOMDocument::__construct — 新しい DOMDocument オブジェクトを作成する
DOMDocument::createAttribute — 新しい属性を作成する
DOMDocument::createAttributeNS — 関連付けられた名前空間に新しい属性を作成する
DOMDocument::createCDATASection — 新しい cdata ノードを作成する
DOMDocument::createComment — 新しい comment ノードを作成する
DOMDocument::createDocumentFragment — 新しい文書片を作成する
DOMDocument::createElement — 新しい要素ノードを作成する
DOMDocument::createElementNS — 関連付けられた名前空間に新しい要素を作成する
DOMDocument::createEntityReference — 新しいエンティティ参照ノードを作成する
DOMDocument::createProcessingInstruction — 新しい PI ノードを作成する
DOMDocument::createTextNode — 新しいテキストノードを作成する
DOMDocument::getElementById — id に対応する要素を検索する
DOMDocument::getElementsByTagName — 指定したローカルタグ名に対応するすべての要素を検索する
DOMDocument::getElementsByTagNameNS — 指定した名前空間で、タグ名に対応するすべての要素を検索する
DOMDocument::importNode — 現在のドキュメントにノードをインポートする
DOMDocument::load — ファイルから XML を読み込む
DOMDocument::loadHTML — 文字列から HTML を読み込む
DOMDocument::loadHTMLFile — ファイルから HTML を読み込む
DOMDocument::loadXML — 文字列から XML を読み込む
DOMDocument::normalizeDocument — ドキュメントを正規化する
DOMDocument::prepend — 最初の子ノードの前にノードを追加する
DOMDocument::registerNodeClass — 基底ノード型を作成する際に使用する拡張クラスを登録する
DOMDocument::relaxNGValidate — ドキュメントを relaxNG で検証する
DOMDocument::relaxNGValidateSource — ドキュメントを relaxNG で検証する
DOMDocument::replaceChildren — ドキュメントの子を置換する
DOMDocument::save — 内部の XML ツリーをファイルに出力する
DOMDocument::saveHTML — 内部のドキュメントを HTML 形式の文字列として出力する
DOMDocument::saveHTMLFile — 内部のドキュメントを HTML 形式でファイルに出力する
DOMDocument::saveXML — 内部の XML ツリーを文字列として出力する
DOMDocument::schemaValidate — スキーマに基づいてドキュメントを検証する。XML Schema 1.0 のみサポート。
DOMDocument::schemaValidateSource — スキーマに基づいてドキュメントを検証する
DOMDocument::validate — DTD に基づいてドキュメントを検証する
DOMDocument::xinclude — DOMDocument オブジェクト内の XIncludes を置換する

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 19 notes

down

115

Fernando H ¶

18 years ago

Showing a quick example of how to use this class, just so that new users can get a quick start without having to figure it all out by themself. ( At the day of posting, this documentation just got added and is lacking examples. )

<?php

// Set the content type to be XML, so that the browser will   recognise it as XML.
header( "content-type: application/xml; charset=ISO-8859-15" );

// "Create" the document.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );

// Create some elements.
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track", "The ninth symphony" );

// Set the attributes.
$xml_track->setAttribute( "length", "0:01:15" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );

// Create another element, just to show you can add any (realistic to computer) number of sublevels.
$xml_note = $xml->createElement( "Note", "The last symphony composed by Ludwig van Beethoven." );

// Append the whole bunch.
$xml_track->appendChild( $xml_note );
$xml_album->appendChild( $xml_track );

// Repeat the above with some different values..
$xml_track = $xml->createElement( "Track", "Highway Blues" );

$xml_track->setAttribute( "length", "0:01:33" );
$xml_track->setAttribute( "bitrate", "64kb/s" );
$xml_track->setAttribute( "channels", "2" );
$xml_album->appendChild( $xml_track );

$xml->appendChild( $xml_album );

// Parse the XML.
print $xml->saveXML();

?>

Output:
<Album>
  <Track length="0:01:15" bitrate="64kb/s" channels="2">
    The ninth symphony
    <Note>
      The last symphony composed by Ludwig van Beethoven.
    </Note>
  </Track>
  <Track length="0:01:33" bitrate="64kb/s" channels="2">Highway Blues</Track>
</Album>

If you want your PHP->DOM code to run under the .xml extension, you should set your webserver up to run the .xml extension with PHP ( Refer to the installation/configuration configuration for PHP on how to do this ).

Note that this:
<?php
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = $xml->createElement( "Album" );
$xml_track = $xml->createElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

is NOT the same as this:
<?php
// Will NOT work.
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml_track = new DOMElement( "Track" );
$xml_album->appendChild( $xml_track );
$xml->appendChild( $xml_album );
?>

although this will work:
<?php
$xml = new DOMDocument( "1.0", "ISO-8859-15" );
$xml_album = new DOMElement( "Album" );
$xml->appendChild( $xml_album );
?>

down

developer at nabtron dot com ¶

10 years ago

For those landing here and checking for encoding issue with utf-8 characteres, it's pretty easy to correct it, without adding any additional output tag to your html.

We'll be utilizing: mb_convert_encoding

Thanks to the user who shared: SmartDOMDocument in previous comments, I got the idea of solving it. However I truly wish that he shared the method instead of giving a link.

Anyway coming back to the solution, you can simply use:

<?php

            // checks if the content we're receiving isn't empty, to avoid the warning
            if ( empty( $content ) ) {
                return false;
            }

            // converts all special characters to utf-8
            $content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');

            // creating new document
            $doc = new DOMDocument('1.0', 'utf-8');

            //turning off some errors
            libxml_use_internal_errors(true);

            // it loads the content without adding enclosing html/body tags and also the doctype declaration
            $doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

            // do whatever you want to do with this code now

?>

I hope it solves the issue for someone! If you need my help or service to fix your code, you can reach me on nabtron.com or contact me at the email mentioned with this comment.

down

jay at jaygilford dot com ¶

16 years ago

Here's a small function I wrote to get all page links using the DOMDocument which will hopefully be of use to others

<?php
/**
 * @author Jay Gilford
 */
 
/**
 * get_links()
 * 
 * @param string $url
 * @return array
 */
function get_links($url) {
 
    // Create a new DOM Document to hold our webpage structure
    $xml = new DOMDocument();
 
    // Load the url's contents into the DOM
    $xml->loadHTMLFile($url);
 
    // Empty array to hold all links to return
    $links = array();
 
    //Loop through each <a> tag in the dom and add it to the link array
    foreach($xml->getElementsByTagName('a') as $link) {
        $links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
    }
 
    //Return the links
    return $links;
}
?>

down

andreas at userbrain dot com ¶

4 years ago

After struggling with parsing and modifying partial HTML content for several hours, I came to this solution which does work for me and is relatively simple compared to what else I found online.

This solution fixes unwanted DOCTYPE and html, body tags as well as encoding issues.

<?php

// Assumption: content is utf-8 encoded
$content = "<h1>This is a heading</h1><p>This is a paragraph</p>";

// Load content to a div and specify encoding with a meta tag
$temp_dom = new DOMDocument();
$temp_dom->loadHTML("<meta http-equiv='Content-Type' content='charset=utf-8' /><div>$content</div>");

// As loadHTML() adds a DOCTYPE as well as <html> and <body> tag, let’s create another DOMDocument and import just the nodes we want
$dom = new DOMDocument();
$first_div = $temp_dom->getElementsByTagName('div')[0];
$first_div_node = $dom->importNode($first_div, true);
$dom->appendChild($first_div_node);

// Do whatever you want to do
$dom->getElementsByTagName('h1')[0]->setAttribute('class', 'happy');

// You could also just echo $dom->saveHtml() if you don’t mind the div and whitespace 
echo substr(trim($dom->saveHtml()), 5, -6);

// Outputs: <h1 class="happy">This is a heading</h1><p>This is a paragraph</p>
?>

down

tloach at gmail dot com ¶

16 years ago

For anyone else who has been having issues with formatOuput not working, here is a work-around:

rather than just doing something like:

<?php
$outXML = $xml->saveXML();
?>

force it to reload the XML from scratch, then it will format correctly:

<?php
$outXML = $xml->saveXML();
$xml = new DOMDocument();
$xml->preserveWhiteSpace = false;
$xml->formatOutput = true;
$xml->loadXML($outXML);
$outXML = $xml->saveXML();
?>

down

biker dot mike at gmx dot com ¶

9 years ago

Look out for the following gotcha when loading XML from a string:

<?php
$doc = new \DOMDocument;
$doc->documentURI = $myXmlFilename;
$doc->loadXML($myXmlString);
?>

documentURI is now set to the value of $myXmlFilename, right?

Wrong!

It's set to the current working directory.  If you want to manually set documentURI to something other than the CWD, do so AFTER the call to loadXML().

E.g.:
<?php
$doc = new \DOMDocument;
$doc->loadXML($myXmlString);
$doc->documentURI = $myXmlFilename;
?>

documentURI really is now set to the value of $myXmlFilename.

down

Nick M ¶

15 years ago

You may need to save all or part of a DOMDocument as an XHTML-friendly string, something compliant with both XML and HTML 4. Here's the DOMDocument class extended with a saveXHTML method:

<?php

/**
 * XHTML Document
 *
 * Represents an entire XHTML DOM document; serves as the root of the document tree.
 */
class XHTMLDocument extends DOMDocument {

  /**
   * These tags must always self-terminate. Anything else must never self-terminate.
   * 
   * @var array
   */
  public $selfTerminate = array(
      'area','base','basefont','br','col','frame','hr','img','input','link','meta','param'
  );
  
  /**
   * saveXHTML
   *
   * Dumps the internal XML tree back into an XHTML-friendly string.
   *
   * @param DOMNode $node
   *         Use this parameter to output only a specific node rather than the entire document.
   */
  public function saveXHTML(DOMNode $node=null) {
    
    if (!$node) $node = $this->firstChild;
    
    $doc = new DOMDocument('1.0');
    $clone = $doc->importNode($node->cloneNode(false), true);
    $term = in_array(strtolower($clone->nodeName), $this->selfTerminate);
    $inner='';
    
    if (!$term) {
      $clone->appendChild(new DOMText(''));
      if ($node->childNodes) foreach ($node->childNodes as $child) {
        $inner .= $this->saveXHTML($child);
      }
    }
    
    $doc->appendChild($clone);
    $out = $doc->saveXML($clone);
    
    return $term ? substr($out, 0, -2) . ' />' : str_replace('><', ">$inner<", $out);

  }

}

?>

This hasn't been benchmarked, but is probably significantly slower than saveXML or saveHTML and should be used sparingly.

down

pastormontesinos at gmail dot com ¶

5 years ago

For using safely with script nodes when parsing, best option is extending DOMDocument, keeping script tags while DOMDocument process and rearrange them just after saveHTML function is called. Here is my custom class.

<?php 

class SafeDOMDocument extends \DOMDocument
{
    const REGEX_JS            = '#(\s*<!--(\[if[^\n]*>)?\s*(<script.*</script>)+\s*(<!\[endif\])?-->)|(\s*<script.*</script>)#isU';
    const SUBSTITUTION_FORMAT = '<!--<script class="script_%s"></script>-->';
    private $matchedScripts = [];

    public function loadHTML($source, $options = 0)
    {
        $this->formatOutput        = false;
        $this->preserveWhiteSpace  = true;
        $this->validateOnParse     = false;
        $this->strictErrorChecking = false;
        $this->recover             = false;
        $this->resolveExternals    = false;
        $this->substituteEntities  = false;
        $matches = [];
        $success = preg_match_all(self::REGEX_JS, $source, $matches);

        if ($success && !empty($matches)) {
            foreach ($matches[0] as $match) {
                $storedScript = rtrim(ltrim($match, "\n\r\t "), "\n\r\t ");
                $scriptId = md5($storedScript);
                $key = sprintf(self::SUBSTITUTION_FORMAT, $scriptId);
                $source = str_replace($match, $key, $source);
                $this->matchedScripts[$key] = $storedScript;
            }
        }

        return parent::loadHTML($source, $options);
    }

    public function saveHTML(DOMNode $node = null)
    {
        $output = parent::saveHTML($node);

        if (count($this->matchedScripts)) {
            foreach ($this->matchedScripts as $substitution => $originalSnippet) {
                $output = str_replace($substitution, $originalSnippet, $output);
            }
        }

        return $output;
    }
}
?>

down

fcartegnie ¶

16 years ago

Be careful with formatOutput().

Creating an empty node like this:
createElement('foo','')
instead of
createElement('foo')
will break formatOutput.

down

evert at er dot nl ¶

15 years ago

A nice and simple node 2 array I wrote, worth a try ;) 

<?php
function getArray($node)
{
    $array = false;

    if ($node->hasAttributes())
    {
        foreach ($node->attributes as $attr)
        {
            $array[$attr->nodeName] = $attr->nodeValue;
        }
    }

    if ($node->hasChildNodes())
    {
        if ($node->childNodes->length == 1)
        {
            $array[$node->firstChild->nodeName] = $node->firstChild->nodeValue;
        }
        else
        {
            foreach ($node->childNodes as $childNode)
            {
                if ($childNode->nodeType != XML_TEXT_NODE)
                {
                    $array[$childNode->nodeName][] = $this->getArray($childNode);
                }
            }
        }
    }

    return $array;
}
?>

down

devour at php dot net ¶

1 year ago

While DOMDocument can technically be used to parse HTML, it is not ideal for HTML documents and is better suited for processing well-formed XML. One of the primary issues with using DOMDocument for HTML is its strict handling of special characters, such as the ampersand (&).

DOMDocument requires that ampersands be escaped as &amp;, which is in line with XML standards but can be counterintuitive for handling real-world HTML, where raw & characters are commonly found, especially in URLs and text. This behavior stems from the underlying XML-based parser (libxml), which treats HTML with the same strictness as XML.

This problem has been reported as far back as 2001, yet the same parsing errors continue to occur when using DOMDocument on HTML documents today.

A common workaround developers use is to suppress the error reporting from DOMDocument, particularly when parsing errors like unescaped ampersands occur. However, suppressing these errors is not recommended, especially in production environments, as it can hide important issues and pose potential security risks. Ignoring or suppressing errors can leave warnings unnoticed, which may result in vulnerabilities if not properly addressed.

For these reasons, it's advisable to use DOMDocument primarily for XML documents, or to consider more appropriate libraries  when working with HTML to avoid these issues.

theCoder / MV

down

cmyk777 at gmail dot com ¶

17 years ago

This function may help to debug current dom element:

<?php
function dom_dump($obj) {
    if ($classname = get_class($obj)) {
        $retval = "Instance of $classname, node list: \n";
        switch (true) {
            case ($obj instanceof DOMDocument):
                $retval .= "XPath: {$obj->getNodePath()}\n".$obj->saveXML($obj);
                break;
            case ($obj instanceof DOMElement):
                $retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
                break;
            case ($obj instanceof DOMAttr):
                $retval .= "XPath: {$obj->getNodePath()}\n".$obj->ownerDocument->saveXML($obj);
                //$retval .= $obj->ownerDocument->saveXML($obj);
                break;
            case ($obj instanceof DOMNodeList):
                for ($i = 0; $i < $obj->length; $i++) {
                    $retval .= "Item #$i, XPath: {$obj->item($i)->getNodePath()}\n".
"{$obj->item($i)->ownerDocument->saveXML($obj->item($i))}\n";
                }
                break;
            default:
                return "Instance of unknown class";
        }
    } else {
        return 'no elements...';
    }
    return htmlspecialchars($retval);
}
?>

Example usage:

<?php
$dom = new DomDocument();
$dom->load('test.xml');
$body = $dom->documentElement->getElementsByTagName('book');
echo '<pre>'.dom_dump($body).'<pre>';
?>

Output:

Instance of DOMNodeList, node list: 
Item #0, XPath: /library/book[1]
<book isbn="0345342968">
<title>Fahrenheit 451</title>
<author>R. Bradbury</author>
<publisher>Del Rey</publisher>
</book>
Item #1, XPath: /library/book[2]
<book isbn="0048231398">
<title>The Silmarillion</title>
<author>J.R.R. Tolkien</author>
<publisher>G. Allen &amp; Unwin</publisher>
</book>
Item #2, XPath: /library/book[3]
<book isbn="0451524934">
<title>1984</title>
<author>G. Orwell</author>
<publisher>Signet</publisher>
</book>
Item #3, XPath: /library/book[4]
<book isbn="031219126X">
<title>Frankenstein</title>
<author>M. Shelley</author>
<publisher>Bedford</publisher>
</book>
Item #4, XPath: /library/book[5]
<book isbn="0312863551">
<title>The Moon Is a Harsh Mistress</title>
<author>R. A. Heinlein</author>
<publisher>Orb</publisher>
</book>

down

sites.sitesbr.net ¶

13 years ago

How to objetify a DomDocument with hierarchy like:
<root>
    <item>
          <prop1>info1</prop1>
          <prop2>info2</prop2>
          <prop3>info3</prop3>
     </item>
    <item>
          <prop1>info1</prop1>
          <prop2>info2</prop2>
          <prop3>info3</prop3>
     </item>
</root>

It's possible to use in object style to retrieve information, as:

<?php
     $theNodeValue = $aitem->prop1;
?>

Here is the code: one Class and 2 functions.

<?php
 class ArrayNode{
       public $nodeName, $nodeValue;
 }

 function getChildNodeElements( $domNode ){
     $nodes = array();
     for( $i=0; $i < $domNode->childNodes->length; $i++){
       $cn = $domNode->childNodes->item($i);
       if( $cn->nodeType == 1){
           $nodes[] = $cn;
           }
     }
    return $nodes;
 }

 function getArrayNodes( $domDoc ){
     $res = array();

       for( $i=0; $i < $domDoc->childNodes->length; $i++){
       $cn = $domDoc->childNodes->item($i);
       # The first is the root tag...
          if( $cn->nodeType == 1){
               # But we want it's childNodes.
                $sub_cn = getChildNodeElements( $cn);
                # Found the tagName:
                $baseItemTagName = $sub_cn[0]->nodeName;
                break;
            }
        }

       $dnl = $domDoc->getElementsByTagName( $baseItemTagName);

       for( $i=0; $i< $dnl->length; $i++){
          $arrayNode = new ArrayNode();

      # Summary
      $arrayNode->nodeName = $dnl->item($i)->nodeName;
      $arrayNode->nodeValue = $dnl->item($i)->nodeValue;

      # Child Nodes
      $cn = $dnl->item($i)->childNodes;
      for( $k=0; $k<$cn->length; $k++){
           if( $cn->item($k)->nodeName == "#text" && trim($cn->item($k)->nodeValue) == "") continue;
           $arrayNode->{$cn->item($k)->nodeName} = $cn->item($k)->nodeValue;
      }

      # Attributes
      $attr = $dnl->item($i)->attributes;
      for( $k=0; $k < $attr->length; $k++){
           if(! is_null($attr)){
            if( $attr->item($k)->nodeName == "#text" && trim($attr->item($k)->nodeValue) == "") continue;
            $arrayNode->{$attr->item($k)->nodeName} = $attr->item($k)->nodeValue;
           }
      }

      $res[] = $arrayNode;

       }

     return $res;
 }
?>

To use it:

<?php

  # First you load a XML in a DomDocument variable.

   $url = "/path/to/yourxmlfile.xml";
   $domSrc = file_get_contents($url);
   $dom = new DomDocument();
   $dom->loadXML( $domSrc );

  # Then, you get the ArrayNodes from the DomDocument.

    $ans = getArrayNodes( $dom );

 
    for( $i=0; $i < count( $ans ) ; $i++){

    $cn =  $ans[ $i];

    $info1 =  $cn->prop1;
    $info2 =  $cn->prop2;
    $info3 =  $cn->prop3;
      
         // ...
 
   }

?>

down

610010559 at qq dot com ¶

4 years ago

when you add the new element to formatted XML data through appendChild() method, you would the new element you add is not be formatted(that is not indexed, not line break).  here is my solution (in short load the xml without preserve white space, ), example show as below:
<?php
$doc = new \DOMDocument();
$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;//that is key, default value is true. 
$doc->loadXML($xmlStr);
$doc->appendChild($doc->createElement('php', '666'))
$formattedXMLStr = $doc->saveXML();//DOMDocument wold format the xml str for you
echo $formattedXMlStr;
?>
it take me some time to try it out. hope it save your time.

down

ashjkshdu283 at gmail dot com ¶

8 years ago

/* Function evolved from jay at jaygilford dot com post
  * This function will return an array of the values of the specified
  * attribute ($attr) for all the Dom Document object's elements 
  */

<?php

function getAttrData(string $attr, DomDocument $dom) { 
    // Empty array to hold all classes to return 
    $attrData = array(); 

    //Loop through each tag in the dom and add it's attribute data to the array 
    foreach($dom->getElementsByTagName('*') as $tag) {
        if(empty($tag->getAttribute($attr)) === false) {
            array_push($attrData, $tag->getAttribute($attr));
        }
    } 

    //Return the array of attribute data
    return array_unique($attrData); 
}

$html = '
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<a href="#someLink" id="someLink" class="link-class">Some Link</a>
<a href="#someOtherLink" id="someOtherLink" class="link-class">Some Other Link</a>
<h1 id="header1" class="header-class">My First Heading</h1>
<p id="para1" class="para-class">My first paragraph.</p>
</body>
</html>';
$dom = new DOMDocument();
$dom->loadHtml($html);
$dom->saveHTML();
var_dump(getAttrData('class', $dom));

down

ingjetel at gmail dot com ¶

11 years ago

Easy function for basic output of XML file via DOM parsing

<?php
$dom = new DomDocument();
$dom->load("./file.xml") or die("error");
$start = $dom->documentElement;
fc($start);

function fc($node) {
  $child = $node->childNodes;
  foreach($child as $item) {
    if ($item->nodeType == XML_TEXT_NODE) {
      if (strlen(trim($item->nodeValue))) echo trim($item->nodeValue)."<br/>";
    }
    else if ($item->nodeType == XML_ELEMENT_NODE) fc($item);
  }
}
?>

down

-1

admin at beerpla dot net ¶

16 years ago

After seeing many complaints about certain DOMDocument shortcomings, such as bad handling of encodings and always saving HTML fragments with <html>, <head>, and DOCTYPE, I decided that a better solution is needed.

So here it is: SmartDOMDocument. You can find it at http://beerpla.net/projects/smartdomdocument/

Currently, the main highlights are:

- SmartDOMDocument inherits from DOMDocument, so it's very easy to use - just declare an object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior on top of all existing functionality (see example below).

- saveHTMLExact() - DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain <html> and <body> tags, it adds them automatically (yup, there are no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body> and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want - it saves HTML without adding that extra garbage that DOMDocument does.

- encoding fix - DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal with encoding correctly. This behavior is transparent to you - just use loadHTML() as you would normally.

- SmartDOMDocument Object As String - you can use a SmartDOMDocument object as a string which will print out its contents.
For example:
<?php
echo "Here is the HTML: $smart_dom_doc";
?>

I'm going to maintain this code and try to fix bugs as they come in.

Enjoy.

down

-1

danny dot nunez15 at gmail dot com ¶

12 years ago

A simple function to grab all links in a page. 

    function get_links($url) {

        // Create a new DOM Document to hold our webpage structure 
        $xml = new DOMDocument();

        // Load the url's contents into the DOM 

        $xml->loadHTMLFile($url);

        // Empty array to hold all links to return 
        $links = array();

        //Loop through each <a> tag in the dom and add it to the link array 
        foreach ($xml->getElementsByTagName('a') as $link) {
            $url = $link->getAttribute('href');
            if (!empty($url)) {
                $links[] = $link->getAttribute('href');
            }
        }

        //Return the links 
        return $links;
    }

down

-5

qrworld.net ¶

11 years ago

In this post http://softontherocks.blogspot.com/2014/11/descargar-el-contenido-de-una-url_11.html I found a simple way to get the content of a URL with DOMDocument, loadHTMLFile and saveHTML().

function getURLContent($url){
    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    @$doc->loadHTMLFile($url);
    return $doc->saveHTML();
}

＋add a note

DOMDocument クラス

はじめに

クラス概要

プロパティ

変更履歴

注意

参考

目次

Found A Problem?

User Contributed Notes 19 notes