Here’s a snippet for you. I was extracting SharePoint profile information and needed to display the “About Me” field in an application that couldn’t render HTML, so I needed to convert the HTML to text. This is a bit more complex than simply stripping the HTML tags out of the content. Doing that would preserve whitespace that should be ignored and would also ignore whitespace that is indicated by things like br, P, H1 tags, etc.
function Html-ToText {
param([System.String] $html)
# remove line breaks, replace with spaces
$html = $html -replace "(`r|`n|`t)", " "
# write-verbose "removed line breaks: `n`n$html`n"
# remove invisible content
@('head', 'style', 'script', 'object', 'embed', 'applet', 'noframes', 'noscript', 'noembed') | % {
$html = $html -replace "<$_[^>]*?>.*?</$_>", ""
}
# write-verbose "removed invisible blocks: `n`n$html`n"
# Condense extra whitespace
$html = $html -replace "( )+", " "
# write-verbose "condensed whitespace: `n`n$html`n"
# Add line breaks
@('div','p','blockquote','h[1-9]') | % { $html = $html -replace "</?$_[^>]*?>.*?</$_>", ("`n" + '$0' )}
# Add line breaks for self-closing tags
@('div','p','blockquote','h[1-9]','br') | % { $html = $html -replace "<$_[^>]*?/>", ('$0' + "`n")}
# write-verbose "added line breaks: `n`n$html`n"
#strip tags
$html = $html -replace "<[^>]*?>", ""
# write-verbose "removed tags: `n`n$html`n"
# replace common entities
@(
@("&bull;", " * "),
@("&lsaquo;", "<"),
@("&rsaquo;", ">"),
@("&(rsquo|lsquo);", "'"),
@("&(quot|ldquo|rdquo);", '"'),
@("&trade;", "(tm)"),
@("&frasl;", "/"),
@("&(quot|#34|#034|#x22);", '"'),
@('&(amp|#38|#038|#x26);', "&"),
@("&(lt|#60|#060|#x3c);", "<"),
@("&(gt|#62|#062|#x3e);", ">"),
@('&(copy|#169);', "(c)"),
@("&(reg|#174);", "(r)"),
@("&nbsp;", " "),
@("&(.{2,6});", "")
) | % { $html = $html -replace $_[0], $_[1] }
# write-verbose "replaced entities: `n`n$html`n"
return $html
}
And you can run it like this:
Html-ToText (new-object net.webclient).DownloadString("<a href="http://winstonfassett.com">http://winstonfassett.com</a>")
[...] den spitzen Klammern durch Leerstring zu ersetzen… Es gibt nichts, was es nicht gibt winstonfassett.com/blog | HTML to Text conversion in PowerShell Gute Arbeit Marcus, iss denke ich viel bei rum gekommen. [...]