Detecting and removing byte order marks (BOMs) from Unicode text files

Some Unicode-encoded text files have a byte order mark or BOM set.

Detecting and removing this mark before inputting the file to Matrix is recommended.

There are multiple ways of detecting and removing a BOM from a Unicode-encoded text file.

The specific actions for doing so change depending on the host operating system and end-user applications used.

The procedures below use two Unix-native shell utilities.

These utilities are included by default with macOS and virtually all Linux distributions.

These utilities are also included with the Windows Subsystem for Linux, which is freely-available from Microsoft.

Byte order marks

UTF-16 and UTF-32 are byte-ordering dependent when serializing text and the Unicode specification supports, but does not require, the setting of a byte order mark (BOM) string at the beginning of a text file.

When set, this string consists of two code points:

  • U+FEFF zero width no-break space (byte order mark); or

  • U+FFFE (a noncharacter)

These two strings are mirror images of each other and, when set at the beginning of a file, they function to declare the byte ordering used in the file.

As noted, however, BOMs are not a required string. When Unicode-encoded text files are written out, most applications do not set a BOM string. And most applications do not expect the BOM string when parsing a Unicode-encoded text file.

In this regard, Matrix is in line with most applications.

Consequently, when a Unicode-encoded text file with a BOM is parsed by Matrix, the BOM string can affect the input.

For example, when a CSV file with a BOM is parsed, the first column of values may be either ignored or presented as empty values.

Detecting BOMs

BOMs do not present when a Unicode-encoded text file is opened in a text editor or text viewer and do not present when the file is sent to STD OUT.

For example, the output of the following command does not change whether the file has a BOM or not:

cat unicode-example-with-or-without-a-bom.csv

Instead, to check whether a BOM is present, use the file -k command.

If a file has a BOM, this fact is part of the data returned to STD OUT.

file -k unicode-example-with-bom.csv
unicode-example-with-bom.csv: CSV text
- , Unicode text, UTF-8 (with BOM) text, with CRLF line terminators

If a file does not have a BOM, this fact is (by omission) part of the data returned to STD OUT.

file -k unicode-example-with-no-bom.md
unicode-example-with-no-bom.md: Unicode text, UTF-8 text

Removing BOMs

To remove a BOM from a file, example.text, edit the file sed and write out a copy of the file example-no-bom.text as follows.

sed '1s/^\xEF\xBB\xBF//' example.text > example-no-bom.text

To document the above command:

1

Make the following edits on the first line of the file only.

s/^\xEF\xBB\xBF//

Find a hexadecimal byte string — 0xef 0xbb 0xbf — that is at the beginning of the line — ^ — and replace it with an empty string.

0xef 0xbb 0xbf corresponds to the two code points that constitute the BOM as noted previously.

example.text > example-no-bom.text

Read in the file example.text and write the changes out to a new file, example-no-bom.text.

Check the BOM is gone

To check the BOM has been removed, run file -k on the edited file.

It will return something equivalent to the second example presented previously.

file -k example-no-bom.text
example-no-bom.text: CSV text
- , Unicode text, UTF-8 text, with CRLF line terminators