Detecting and removing byte order marks (BOMs) from Unicode text files
Some Unicode-encoded text files have a byte order mark or BOM set.
Detecting and removing this mark before inputting the file to Matrix is recommended.
There are multiple ways of detecting and removing a BOM from a Unicode-encoded text file. The specific actions for doing so change depending on the host operating system and end-user applications used. The procedures below use two Unix-native shell utilities. These utilities are included by default with macOS and virtually all Linux distributions. These utilities are also included with the Windows Subsystem for Linux, which is freely-available from Microsoft. |
Byte order marks
UTF-16 and UTF-32 are byte-ordering dependent when serializing text and the Unicode specification supports, but does not require, the setting of a byte order mark (BOM) string at the beginning of a text file.
When set, this string consists of two code points:
-
U+FEFF zero width no-break space (byte order mark); or
-
U+FFFE (a noncharacter)
These two strings are mirror images of each other and, when set at the beginning of a file, they function to declare the byte ordering used in the file.
As noted, however, BOMs are not a required string. When Unicode-encoded text files are written out, most applications do not set a BOM string. And most applications do not expect the BOM string when parsing a Unicode-encoded text file.
In this regard, Matrix is in line with most applications.
Consequently, when a Unicode-encoded text file with a BOM is parsed by Matrix, the BOM string can affect the input.
For example, when a CSV file with a BOM is parsed, the first column of values may be either ignored or presented as empty values.
Detecting BOMs
BOMs do not present when a Unicode-encoded text file is opened in a text editor or text viewer and do not present when the file is sent to STD OUT.
For example, the output of the following command does not change whether the file has a BOM or not:
cat unicode-example-with-or-without-a-bom.csv
Instead, to check whether a BOM is present, use the file -k
command.
If a file has a BOM, this fact is part of the data returned to STD OUT.
file -k unicode-example-with-bom.csv
unicode-example-with-bom.csv: CSV text
- , Unicode text, UTF-8 (with BOM) text, with CRLF line terminators
If a file does not have a BOM, this fact is (by omission) part of the data returned to STD OUT.
file -k unicode-example-with-no-bom.md
unicode-example-with-no-bom.md: Unicode text, UTF-8 text
Removing BOMs
To remove a BOM from a file, example.text, edit the file sed
and write out a copy of the file example-no-bom.text as follows.
sed '1s/^\xEF\xBB\xBF//' example.text > example-no-bom.text
To document the above command:
- 1
-
Make the following edits on the first line of the file only.
- s/^\xEF\xBB\xBF//
-
Find a hexadecimal byte string — 0xef 0xbb 0xbf — that is at the beginning of the line — ^ — and replace it with an empty string.
0xef 0xbb 0xbf corresponds to the two code points that constitute the BOM as noted previously.
- example.text > example-no-bom.text
-
Read in the file example.text and write the changes out to a new file, example-no-bom.text.