Accessing single characters in multibyte strings

Tue Jul 12, 2022 Share on:

During a recent project, I needed to grab the first character of a string to build an A-Z array to organise the strings in alphabetical order. Typically, when tasked with getting the first character of a string I'd only need to do something like:

<?php
$str = 'Island of Ireland';
$first = substr($str, 0, 1);
# Alternatively
# $first = $str[0];
echo $first; // prints I;

This works perfectly well for English characters as they are single-byte characters. However, if you're working with a multibyte language like Danish which has characters such as ø trying to get this character using the method above fails miserably and outputs something like.

Your output contains characters that could not be displayed. This might have caused your empty result. Make sure you use utf8_encode around your output when working with special characters or binary data.

How do we get special characters from a string?

In order to get this character, we have to use the mb_ functions provided by PHP. The above code for working with UT8 special characters now becomes:

<?php
$str = 'Øen Irland';
$first = mb_substr($str, 0, 1);
echo $first; // prints Ø

Awesome, we've now got our character. However, the next challenge is to sort the array in Alphabetical order. Typically we could do something like:

<?php
$arr = ['B','Z','O','M','E','A'];
sort($arr);
var_dump($arr);
# Returns
# array(6) {
#   [0]=>
#   string(1) "A"
#   [1]=>
#   string(1) "B"
#   [2]=>
#   string(1) "E"
#   [3]=>
#   string(1) "M"
#   [4]=>
#   string(1) "O"
#   [5]=>
#   string(1) "Z"
# }

However, if we try that with special characters such as ø and é (I know, é is not Danish, but go with me) we get:

<?php
$arr = ['B','Ø','Z','O','M','E','A','É'];
sort($arr);
var_dump($arr);
# Returns
# array(8) {
#   [0]=>
#   string(1) "A"
#   [1]=>
#   string(1) "B"
#   [2]=>
#   string(1) "E"
#   [3]=>
#   string(1) "M"
#   [4]=>
#   string(1) "O"
#   [5]=>
#   string(1) "Z"
#   [6]=>
#   string(2) "É"
#   [7]=>
#   string(2) "Ø"
# }

Now, that's just not right! So what do we do now?

Sorting Arrays with UTF8 Special Characters

For this, we'll need to create a Collator object. This allows us to define locale so we can sort an array according to defined locale rules.

<?php
$arr = ['B','Ø','Z','O','M','E','A','É'];
$collator = new Collator('da_DK');
collator->sort($arr);
var_dump($arr);
# returns
# array(8) {
#   [0]=>
#   string(1) "A"
#   [1]=>
#   string(1) "B"
#   [2]=>
#   string(1) "E"
#   [3]=>
#   string(2) "É"
#   [4]=>
#   string(1) "M"
#   [5]=>
#   string(1) "O"
#   [6]=>
#   string(1) "Z"
#   [7]=>
#   string(2) "Ø"
# }

That's better. It also sorted the non-Danish character É to a position within the array where it makes sense according to its rules.


Next