Strings

Ultimate Rust Crash Course Primitive Types and Control Flow
4 minutes
Share the link to this page
Copied
  Completed

More than you ever wanted to know about Strings in Rust, so you can stop running into walls and get on with your life.

In this video:

  • Strings and borrowed string slices
  • How a string is implemented, high-level
  • UTF-8
  • Bytes
  • Unicode scalars
  • Graphemes
  • Iterators (a little bit) and .nth()

Transcript

Strings. I'm going to warn you up front here be dragons. I'll do my best to steer you right. There are at least six types of strings in the rust standard library, but we mostly care about two of them overlap each other. The first is called a string slice and you will almost always see it as a borrowed string slice. We'll talk more about borrowing later.

A literal string is always a borrowed string slice. A borrowed string slice is often referred to as a string, which can be really confusing when you learn that the other string type is a string with a capital S. The biggest difference between the two is that the data in a borrowed string slice cannot be modified while the data in a string can be modified. You will often create a string by calling the to string method on a borrowed string slice or bypassing a borrowed string, slice to string from borrowed string slice is internally made up of a pointer to some bytes and the length. A string is made up of a pointer to some bytes of length and a capacity that may be higher than what is currently being used. In other words, a borrowed string slice is a subset of a string in more ways than one, which is why they share a bunch of other characteristics.

For example, both string types are valid UTF eight, by definition by compiler enforcement and by runtime checks. Also, strings cannot be indexed by character position. Why not? Because English is not the only language in the world. In fact, Google told me that there were over 6900 living languages and emojis on top of that, and they all seem to make their way into Unicode. And strings are Unicode, which means things get complicated.

Let's take a look at the Thai word salad de let's say that we wanted to get this thing What we think should be indexed three, ultimately this string is stored as a vector of 18 bytes. Would we get what we wanted if we indexed in by bytes, not even close Unicode scalars in UTF, eight can be represented by 123 or four bytes, and you have to traverse the bytes in order to tell where one scalar ends, and the next begins. In this case, every three bytes is a Unicode scalar. So if there were a way to index into the scalars, would we get what we want closer, but still off diacritics are Unicode scalars that combine with other Unicode scalars to produce a different grapheme. And the grapheme is usually what we care about. So now you understand that graphemes decomposed into variable amounts of scalars, which decompose into variable amounts of bytes as part Have rusts emphasis on speed indexing operations on standard library collections are always guaranteed to be constant time operations.

You can't do that with strings because the bytes which are indexable aren't guaranteed to be what people want when they index into a string. And the graphemes, which people do want can only be retrieved after slowly examining a sequence of bytes. So when presented with a string, you have some options, you can use the bytes method to access the vector of UTF eight bytes, which you can index into if you want. Since bytes are fixed size, this actually works fine for Simple English text. As long as you stick to the portion that overlaps ASCII, you can use the cares method to retrieve an iterator that you can use to iterate through the Unicode scalars. And finally, you can use a package like Unicode segmentation which provides handy functions that return iterators Handle graphemes of various types.

With each of these approaches, you know that if you can index into something, it will be a fast constant time operation. While if you iterate through something, it is going to process some variable number of bytes during each iteration of the loop. Hopefully you can sidestep most of these issues by using one of the many helper methods created to manipulate strings. But if you do end up manually using one of the iterators iterators have a handy method called nth that you can use in place of indexing. And now you know why you have to pick an iterator and use nth instead of being able to index into a string directly. In the next video, we will talk about ownership.

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.