Storing UTF-8 Encoded Text with Strings
A
Stringis a wrapper over aVec<u8>.
- Storing UTF-8 Encoded Text with Strings
We talked about strings in Chapter 4, but we’ll look at them in more depth now.
New Rustaceans commonly get stuck on strings for a combination of three reasons:
- Rust’s propensity for exposing possible errors
- strings being a more complicated data structure than many programmers give them credit for,
- and UTF-8.
These factors combine in a way that can seem difficult when you’re coming from other programming languages.
-
We discuss strings in the context of collections because strings are implemented as a collection of bytes, plus some methods to provide useful functionality when those bytes are interpreted as text.
-
In this section, we’ll talk about the operations on
Stringthat every collection type has, such as creating, updating, and reading. -
We’ll also discuss the ways in which
Stringis different from the other collections, namely how indexing into aStringis complicated by the differences between how people and computers interpretStringdata.
What Is a String?
We’ll first define what we mean by the term string.
- Rust has only one string
type in the core language, which is the string slice
strthat is usually seen in its borrowed form&str. - In Chapter 4, we talked about string slices, which are references to some UTF-8 encoded string data stored elsewhere.
- String literals, for example, are stored in the program’s binary and are therefore string slices.
The
Stringtype, which is provided by Rust’s standard library rather than coded into the core language, is a growable, mutable, owned, UTF-8 encoded string type.
- When Rustaceans refer to “strings” in Rust, they might be
referring to either the
Stringor the string slice&strtypes, not just one of those types. - Although this section is largely about
String, both types are used heavily in Rust’s standard library, and bothStringand string slices are UTF-8 encoded.
Creating a New String
Many of the same operations available with
Vec<T>are available withStringas well, becauseStringis actually implemented as a wrapper around a vector of bytes with some extra guarantees, restrictions, and capabilities.
An example
of a function that works the same way with Vec<T> and String is the new
function to create an instance, shown in Listing 8-11.
fn main() { let mut s = String::new(); }
Listing 8-11: Creating a new, empty String
- This line creates a new empty string called
s, which we can then load data into. - Often, we’ll have some initial data that we want to start the string with.
- For that, we use the
to_stringmethod, which is available on any type that implements theDisplaytrait, as string literals do. - Listing 8-12 shows two examples.
fn main() { let data = "initial contents"; let s = data.to_string(); // the method also works on a literal directly: let s = "initial contents".to_string(); }
Listing 8-12: Using the to_string method to create a
String from a string literal
This code creates a string containing initial contents.
We can also use the function String::from to create a String from a string
literal. The code in Listing 8-13 is equivalent to the code from Listing 8-12
that uses to_string.
fn main() { let s = String::from("initial contents"); }
Listing 8-13: Using the String::from function to create
a String from a string literal
Because strings are used for so many things, we can use many different generic APIs for strings, providing us with a lot of options. Some of them can seem redundant, but they all have their place!
In this case,
String::fromandto_stringdo the same thing, so which you choose is a matter of style and readability.
Remember that strings are UTF-8 encoded, so we can include any properly encoded data in them, as shown in Listing 8-14.
fn main() { let hello = String::from("السلام عليكم"); let hello = String::from("Dobrý den"); let hello = String::from("Hello"); let hello = String::from("שָׁלוֹם"); let hello = String::from("नमस्ते"); let hello = String::from("こんにちは"); let hello = String::from("안녕하세요"); let hello = String::from("你好"); let hello = String::from("Olá"); let hello = String::from("Здравствуйте"); let hello = String::from("Hola"); }
Listing 8-14: Storing greetings in different languages in strings
All of these are valid String values.
Updating a String
A String can grow in size and its contents can change, just like the contents
of a Vec<T>, if you push more data into it. In addition, you can conveniently
use the + operator or the format! macro to concatenate String values.
Appending to a String with push_str and push
We can grow a String by using the push_str method to append a string slice,
as shown in Listing 8-15.
fn main() { let mut s = String::from("foo"); s.push_str("bar"); }
Listing 8-15: Appending a string slice to a String
using the push_str method
After these two lines, s will contain foobar. The push_str method takes a
string slice because we don’t necessarily want to take ownership of the
parameter. For example, in the code in Listing 8-16, we want to be able to use
s2 after appending its contents to s1.
fn main() { let mut s1 = String::from("foo"); let s2 = "bar"; s1.push_str(s2); println!("s2 is {s2}"); }
Listing 8-16: Using a string slice after appending its
contents to a String
If the push_str method took ownership of s2, we wouldn’t be able to print
its value on the last line. However, this code works as we’d expect!
The push method takes a single character as a parameter and adds it to the
String. Listing 8-17 adds the letter “l” to a String using the push
method.
fn main() { let mut s = String::from("lo"); s.push('l'); }
Listing 8-17: Adding one character to a String value
using push
As a result, s will contain lol.
Concatenation with the + Operator or the format! Macro
Often, you’ll want to combine two existing strings. One way to do so is to use
the + operator, as shown in Listing 8-18.
fn main() { let s1 = String::from("Hello, "); let s2 = String::from("world!"); let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used }
Listing 8-18: Using the + operator to combine two
String values into a new String value
- The string
s3will containHello, world!. - The reason
s1is no longer valid after the addition, and the reason we used a reference tos2, has to do with the signature of the method that’s called when we use the+operator. - The
+operator uses theaddmethod, whose signature looks something like this:
fn add(self, s: &str) -> String {
- In the standard library, you’ll see
adddefined using generics and associated types. - Here, we’ve substituted in concrete types, which is what happens when we
call this method with
Stringvalues. We’ll discuss generics in Chapter 10. - This signature gives us the clues we need to understand the tricky bits of the
+operator.
First, s2 has an &
First, s2 has an &, meaning that we’re adding a reference of the second
string to the first string.
This is because of the s parameter in the add
function:
- we can only add a
&strto aString; - we can’t add two
Stringvalues together. - But wait—the type of
&s2is&String, not&str, as specified in the second parameter toadd.
So why does Listing 8-18 compile?
- The reason we’re able to use
&s2in the call toaddis that the compiler can coerce the&Stringargument into a&str. - When we call the
addmethod, Rust uses a deref coercion, which here turns&s2into&s2[..]. - We’ll discuss deref coercion in more depth in Chapter 15.
- Because
adddoes not take ownership of thesparameter,s2will still be a validStringafter this operation.
Second, add takes ownership of self
Second, we can see in the signature that add takes ownership of self,
because self does not have an &.
This means s1 in Listing 8-18 will be
moved into the add call and will no longer be valid after that.
So although
let s3 = s1 + &s2; looks like it will copy both strings and create a new one,
this statement actually takes ownership of s1, appends a copy of the contents
of s2, and then returns ownership of the result.
In other words, it looks like it’s making a lot of copies but isn’t; the implementation is more efficient than copying.
If we need to concatenate multiple strings, the behavior of the + operator
gets unwieldy:
fn main() { let s1 = String::from("tic"); let s2 = String::from("tac"); let s3 = String::from("toe"); let s = s1 + "-" + &s2 + "-" + &s3; }
At this point,
swill betic-tac-toe.
With all of the + and "
characters, it’s difficult to see what’s going on.
Using the format! macro
For more complicated string combining, we can instead use the
format!macro:
fn main() { let s1 = String::from("tic"); let s2 = String::from("tac"); let s3 = String::from("toe"); let s = format!("{s1}-{s2}-{s3}"); }
- This code also sets
stotic-tac-toe. - The
format!macro works likeprintln!, but instead of printing the output to the screen, it returns aStringwith the contents. - The version of the code using
format!is much easier to read, and the code generated by theformat!macro uses references so that this call doesn’t take ownership of any of its parameters.
Indexing into Strings
In many other programming languages, accessing individual characters in a
string by referencing them by index is a valid and common operation. However,
if you try to access parts of a String using indexing syntax in Rust, you’ll
get an error.
Consider the invalid code in Listing 8-19.
fn main() {
let s1 = String::from("hello");
let h = s1[0];
}
Listing 8-19: Attempting to use indexing syntax with a String
This code will result in the following error:
$ cargo run
Compiling collections v0.1.0 (file:///projects/collections)
error[E0277]: the type `String` cannot be indexed by `{integer}`
--> src/main.rs:3:13
|
3 | let h = s1[0];
| ^^^^^ `String` cannot be indexed by `{integer}`
|
= help: the trait `Index<{integer}>` is not implemented for `String`
= help: the following other types implement trait `Index<Idx>`:
<String as Index<RangeFrom<usize>>>
<String as Index<RangeFull>>
<String as Index<RangeInclusive<usize>>>
<String as Index<RangeTo<usize>>>
<String as Index<RangeToInclusive<usize>>>
<String as Index<std::ops::Range<usize>>>
<str as Index<I>>
For more information about this error, try `rustc --explain E0277`.
error: could not compile `collections` due to previous error
The error and the note tell the story:Rust strings don’t support indexing.
But why not? To answer that question, we need to discuss how Rust stores strings in memory.
Internal Representation
A
Stringis a wrapper over aVec<u8>.
Let’s look at some of our properly encoded UTF-8 example strings from Listing 8-14.
First, this one:
fn main() { let hello = String::from("السلام عليكم"); let hello = String::from("Dobrý den"); let hello = String::from("Hello"); let hello = String::from("שָׁלוֹם"); let hello = String::from("नमस्ते"); let hello = String::from("こんにちは"); let hello = String::from("안녕하세요"); let hello = String::from("你好"); let hello = String::from("Olá"); let hello = String::from("Здравствуйте"); let hello = String::from("Hola"); }
- In this case,
lenwill be 4, which means the vector storing the string “Hola” is 4 bytes long. - Each of these letters takes 1 byte when encoded in UTF-8.
- The following line, however, may surprise you. (Note that this string begins with the capital Cyrillic letter Ze, not the Arabic number 3.)
fn main() { let hello = String::from("السلام عليكم"); let hello = String::from("Dobrý den"); let hello = String::from("Hello"); let hello = String::from("שָׁלוֹם"); let hello = String::from("नमस्ते"); let hello = String::from("こんにちは"); let hello = String::from("안녕하세요"); let hello = String::from("你好"); let hello = String::from("Olá"); let hello = String::from("Здравствуйте"); let hello = String::from("Hola"); }
- Asked how long the string is, you might say 12.
- In fact, Rust’s answer is 24
- that’s the number of bytes it takes to encode “Здравствуйте” in UTF-8, because each Unicode scalar value in that string takes 2 bytes of storage.
- Therefore, an index into the string’s bytes will not always correlate to a valid Unicode scalar value.
To demonstrate, consider this invalid Rust code:
let hello = "Здравствуйте";
let answer = &hello[0];
- You already know that
answerwill not beЗ, the first letter. - When encoded
in UTF-8, the first byte of
Зis208and the second is151, so it would seem thatanswershould in fact be208, but208is not a valid character on its own. - Returning
208is likely not what a user would want if they asked for the first letter of this string; - however, that’s the only data that Rust has at byte index 0.
- Users generally don’t want the byte value returned, even
if the string contains only Latin letters: if
&"hello"[0]were valid code that returned the byte value, it would return104, noth.
The answer, then, is that to avoid returning an unexpected value and causing bugs that might not be discovered immediately, Rust doesn’t compile this code at all and prevents misunderstandings early in the development process.
Bytes and Scalar Values and Grapheme Clusters! Oh My!
Another point about UTF-8 is that there are actually three relevant ways to look at strings from Rust’s perspective:
- as bytes,
- scalar values,
- and grapheme clusters (the closest thing to what we would call letters).
If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is
stored as a vector of u8 values that looks like this:
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
224, 165, 135]
That’s 18 bytes and is how computers ultimately store this data. If we look at
them as Unicode scalar values, which are what Rust’s char type is, those
bytes look like this:
['न', 'म', 'स', '्', 'त', 'े']
There are six char values here, but the fourth and sixth are not letters:
they’re diacritics that don’t make sense on their own. Finally, if we look at
them as grapheme clusters, we’d get what a person would call the four letters
that make up the Hindi word:
["न", "म", "स्", "ते"]
Rust provides different ways of interpreting the raw string data that computers store so that each program can choose the interpretation it needs, no matter what human language the data is in.
A final reason Rust doesn’t allow us to index into a String to get a
character is that indexing operations are expected to always take constant time
(O(1)). But it isn’t possible to guarantee that performance with a String,
because Rust would have to walk through the contents from the beginning to the
index to determine how many valid characters there were.
Slicing Strings
Indexing into a string is often a bad idea because it’s not clear what the return type of the string-indexing operation should be:
a byte value, a character, a grapheme cluster, or a string slice.
If you really need to use indices to create string slices, therefore, Rust asks you to be more specific.
Rather than indexing using [] with a single number, you can use [] with a
range to create a string slice containing particular bytes:
let hello = "Здравствуйте"; let s = &hello[0..4];
Here, s will be a &str that contains the first 4 bytes of the string.
Earlier, we mentioned that each of these characters was 2 bytes, which means
s will be Зд.
If we were to try to slice only part of a character’s bytes with something like
&hello[0..1], Rust would panic at runtime in the same way as if an invalid
index were accessed in a vector:
$ cargo run
Compiling collections v0.1.0 (file:///projects/collections)
Finished dev [unoptimized + debuginfo] target(s) in 0.43s
Running `target/debug/collections`
thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'З' (bytes 0..2) of `Здравствуйте`', library/core/src/str/mod.rs:127:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
You should use ranges to create string slices with caution, because doing so can crash your program.
Methods for Iterating Over Strings
The best way to operate on pieces of strings is to be explicit about whether you want characters or bytes.
- For individual Unicode scalar values, use the
charsmethod. - Calling
charson “Зд” separates out and returns two values of typechar, and you can iterate over the result to access each element:
for c in "Зд".chars() { println!("{c}"); }
This code will print the following:
З
д
Alternatively, the bytes method returns each raw byte, which might be
appropriate for your domain:
for b in "Зд".bytes() { println!("{b}"); }
This code will print the four bytes that make up this string:
208
151
208
180
But be sure to remember that valid Unicode scalar values may be made up of more than 1 byte.
Getting grapheme clusters from strings as with the Devanagari script is complex, so this functionality is not provided by the standard library. Crates are available on crates.io if this is the functionality you need.
Strings Are Not So Simple
To summarize, strings are complicated.
- Different programming languages make different choices about how to present this complexity to the programmer.
- Rust
has chosen to make the correct handling of
Stringdata the default behavior for all Rust programs, which means programmers have to put more thought into handling UTF-8 data upfront. - This trade-off exposes more of the complexity of strings than is apparent in other programming languages, but it prevents you from having to handle errors involving non-ASCII characters later in your development life cycle.
The good news is that the standard library offers a lot of functionality built off the
Stringand&strtypes to help handle these complex situations correctly.
Be sure to check out the documentation for useful methods like
contains for searching in a string and replace for substituting parts of a
string with another string.
Let’s switch to something a bit less complex: hash maps!