在.Net中获取HTML元素的内部文本并避免额外的空格

通过.Net的MSHTML库,可以快速操作HTML文档树。但是,使用IHTMLElement.innerText获取元素的渲染后文本时,可能存在多余的段首空格。此时,可以使用IHTMLElement.innerHTML属性,先获取HTML格式的文本,再进行后续的解析。

下面的示例代码需要参考COM组件Microsoft HTML Object Library和程序集System.Web

Imports mshtml
Imports System.Text
Imports System.IO
Imports System.Net
Imports System.Web

Private Function GetInnerTextFromTextContainer() As String
    'Create HTML document object and elements collection
    Dim CurrentHTMLDocument As HTMLDocument = Nothing
    Dim PossibleElements As IHTMLElementCollection = Nothing
    Dim CurrentElement As IHTMLElement = Nothing

    'Get HTML document from WebBrowser control
    CurrentHTMLDocument = wbbWebContainer.Document

    'Find div elements with class "text-container"
    PossibleElements = CurrentHTMLDocument.getElementsByTagName("div")
    For Each CurrentElement In PossibleElements
        If CurrentElement.className = "text-container" Then
            'Using CurrentElement.innerText may cause problems (unneeded leading white spaces)
            Dim PageSourceHTML As String = CurrentElement.innerHTML.Replace("<br>", "").Trim()
            Dim PageSourceText As String = HttpUtility.HtmlDecode(PageSourceHTML)
            PageSourceText = PageSourceText.Replace(vbCrLf, vbLf)
            PageSourceText = PageSourceText.Replace(vbCr, vbLf)
            PageSourceText = PageSourceText.Replace(vbLf, Environment.NewLine)
            PageSourceText = PageSourceText.Replace(ChrW(&HA0), " ")
            Return PageSourceText
        End If
    Next

    Return ""
End Function

参考资料:

https://learn.microsoft.com/zh-tw/dotnet/api/system.web.httputility.htmldecode?view=net-8.0

it
除非特别注明,本页内容采用以下授权方式: Creative Commons Attribution-ShareAlike 3.0 License