Encoding XML in UTF-8 with .NET
December 30, 2011 3 Comments
The solution described here was inspired by the blog post found at http://rlacovara.blogspot.com/2011/02/how-to-create-xml-in-c-with-utf-8.html. It explains how to replace the default UTF-16 encoding with UTF-8. I have implemented a variation of this. In addition, a more generic solution is available at http://www.experts-exchange.com/Programming/Languages/C_Sharp/Q_20554526.html. This one (which I have not implemented), allows for variable encoding values for the output.
By default, XML documents produced using C# and the .NET XMLSerializer class are encoded as UTF-16. I recently needed to change this to the more commonly-used UTF-8, and learned a few things along the way.
The first thing that I discovered (and perhaps should have already known) is that internally .NET stores all string representations as UTF-16. That is why, if you don’t change the default encoding, the XML is produced as UTF-16.
Next, I found that the Encoding property of the StringWriter class is read-only, so you can interrogate the default encoding (and see that it is in fact UTF-16) but cannot change it.
As I learned from the blog posts that I referenced above, the solution to changing the default UTF-16 encoding is to subclass the native .NET StringWriter class and override the default Encoding property value.
Following is a solution for producing a UTF-8-encoded XML document. The “StringWriterUtf8” class is the key to the solution. It inherits from the native System.IO.StringWriter class and overrides the Encoding property (returning Encoding.UTF8 instead of Encoding.UTF16). Using an instance of this class as the target for the XML serialization output produces UTF-8 output.