Cutting and slicing strings in Python

Python strings as sequences of characters

Python strings are sequences of individual characters and share their basic methods of access with those other Python sequences – lists and tuples. The simplest way of extracting single characters from strings (and individual members from any sequence) is to unpack them into corresponding variables.

>>> s = 'Don'
>>> s
'Don'
>>> a, b, c = s # Unpack into variables
>>> a
'D'
>>> b
'o'
>>> c
'n'

Unfortunately, it’s not often that we have the luxury of knowing in advance how many variables we are going to need in order to store every character in the string. And if the number of variables we supply doesn’t match with the number of characters in the string, Python will give us an error.

s = 'Don Quijote'
a, b, c = s
Traceback (most recent call last):
File "", line 1, in
ValueError: too many values to unpack

Accessing characters in strings by index in Python

Typically it’s more useful to access the individual characters of a string by using Python’s array-like indexing syntax. Here, as with all sequences, it’s important to remember that indexing is zero-based; that is, the first item in the sequence is number 0.

>>> s = 'Don Quijote'
>>> s[4] # Get the 5th character
'Q'

If you want to start counting from the end of the string, instead of the beginning, use a negative index. For example, an index of -1 refers to the right-most character of the string.

>>> s[-1]
'e'
>>> s[-7]
'Q'

Python strings are immutable, which is just a fancy way of saying that once they’ve been created, you can’t change them. Attempting to do so triggers an error.

>>> s[7]
'j'
>>> s[7] = 'x'
Traceback (most recent call last):
File "", line 1, in
TypeError: 'str' object does not support item assignment

If you want to modify a string, you have to create it as a totally new string. In practice, it’s easy. We’ll look at how in a minute.

Slicing Python strings

Before that, what if you want to extract a chunk of more than one character, with known position and size? That’s fairly easy and intuitive. We extend the square-bracket syntax a little, so that we can specify not only the starting position of the piece we want, but also where it ends.

>>> s[4:8]
'Quij'

Let’s look at what’s happening here. Just as before, we’re specifying that we want to start at position 4 (zero-based) in the string. But now, instead of contenting ourselves with a single character from the string, we’re saying that we want more characters, up to but not including the character at position 8.

You might have thought that you were going to get the character at position 8 too. But that’s not the way it works. Don’t worry – you’ll get used to it. If it helps, think of the second index (the one after the colon) as specifying the first character that you don’t want. Incidentally, a benefit of this mechanism is that you can quickly tell how many characters you are going to end up with simply by subtracting the first index from the second.

Using this syntax, you can omit either or both of the indices. The first index, if omitted, defaults to 0, so that your chunk starts from the beginning of the original string; the second defaults to the highest position in the string, so that your chunk ends at the end of the original string. Omitting both indices isn’t likely to be of much practical use; as you might guess, it simply returns the whole of the original string.

>>> s[4:]
'Quijote' # Returns from pos 4 to the end of the string
>>> s[:4]
'Don ' # Returns from the beginning to pos 3
>>> s[:]
'Don Quijote'

If you’re still struggling to get your head around the fact that, for example, s[0:8] returns everything up to, but not including, the character at position 8, it may help if you roll this around in your head a bit: for any value of index, n, that you choose, the value of s[:n] + s[n:] will always be the same as the original target string. If the indexing mechanism were inclusive, the character at position n would appear twice.

>>> s[6]
'i'
>>> s[:6] + s[6:]
'Don Quijote'

Just as before, you can use negative numbers as indices, in which case the counting starts at the end of the string (with an index of -1) instead of at the beginning.

>>> s[-7:-3]
'Quij'

Skipping character while splitting Python strings

The final variation on the square-bracket syntax is to add a third parameter, which specifies the ‘stride’, or how many characters you want to move forward after each character is retrieved from the original string. The first retrieved character always corresponds to the index before the colon; but thereafter, the pointer moves forward however many characters you specify as your stride, and retrieves the character at that position. And so on, until the ending index is reached or exceeded. If, as in the cases we’ve met so far, the parameter is omitted, it defaults to 1, so that every character in the specified segment is retrieved. An example makes this clearer.

>>> s[4:8]
'Quij'
>>> s[4:8:1] # 1 is the default value anyway, so same result
'Quij'
>>> s[4:8:2] # Return a character, then move forward 2 positions, etc.
'Qi' # Quite interesting!

You can specify a negative stride too. As you might expect, this indicates that you want Python to go backwards when retrieving characters.

>>> s[8:4:-1]
'ojiu'

As you can see, since we’re going backwards, it makes sense for the starting index to be higher than the ending index (otherwise nothing would be returned).

>>> s[4:8:-1]
''

For that reason, if you specify a negative stride, but omit either the first or second index, Python defaults the missing value to whatever makes sense in the circumstances: the start index to the end of the string, and the end index to the beginning of the string. I know, it can make your head ache thinking about it, but Python knows what it’s doing.

>>> s[4::-1] # End index defaults to the beginning of the string
'Q noD'
>>> s[:4:-1] # Beginning index defaults to the end of the string
'etojiu'

So that’s the square-bracket syntax, which allows you to retrieve chunks of characters if you know the exact position of your required chunk in the string.

But what if you want to retrieve a chunk based on the contents of the string, which we may not know in advance?

Examining the contents

Python provides string methods that allows us to chop a string up according to delimiters that we can specify. In other words, we can tell Python to look for a certain substring within our target string, and split the target string up around that sub-string. It does that by returning a list of the resulting sub-strings (minus the delimiters). By the way, we can choose not to specify a delimiter explicitly, in which case it defaults to a white-space character (space, ‘\t’, ‘\n’, ‘\r’, ‘\f’) or sequence of such characters.

Remember that these methods have no effect on the string on which you invoke them; they simply return a new string.

>>> s.split()
['Don', 'Quijote']
>>> s
'Don Quijote' # s has not been changed

More usefully, we can garner up the returned list directly into appropriate variables.

>>> title, handle = s.split()
>>> title
'Don'
>>> handle
'Quijote'

Leaving our Spanish hero to his windmills for a moment, let’s imagine that we have a string containing a clock time in hours, minutes and seconds, delimited by colons. In this case, we might reasonably gather up the separate parts into variables for further manipulation.

>>> tim = '16:30:10'
>>> hrs, mins, secs = tim.split(':')
>>> hrs
'16'
>>> mins
'30'
>>> secs
'10'

We might only want to split the target string once, no matter how many times the delimiter occurs. The split() method will accept a second parameter that specifies the maximum number of splits to perform.

>>> tim.split(':', 1) # split() only once
['16', '30:10']

Here, the string is split on the first colon, and the remainder is left untouched. And if we want Python to start looking for delimiters from the other end of the string? Well, there is a variant method called rsplit(), which does just that.

>>> tim.rsplit(':', 1)
['16:30', '10']

Building a partition

A similar string method is partition(). This also splits up a string based on content, the differences being that the result is a tuple, and it preserves the delimiter, along with the two parts of the target string on either side of it. Unlike split()partition() always does only one splitting operation, no matter how many times the delimiter appears in the target string.

>>> tim = '16:30:10'
>>> tim.partition(':')
('16', ':', '30:10')

As with the split() method, there is a variant of partition()rpartition(), that begins its search for delimiters from the other end of the target string.

>>> tim.rpartition(':')
('16:30', ':', '10')

Using Python’s string.replace()

Now, back to our Don Quijote. Earlier on, when we tried to anglicise his name by changing the ‘j’ to an ‘x’ by assigning the ‘x’ directly to s[7], we found that we couldn’t do it, because you can’t change existing Python strings. But we can get around this by creating a new string that’s more to our liking, based on the old string. The string method that allows us to do this is replace().

>>> s.replace('j', 'x')
'Don Quixote'
>>> s
'Don Quijote' # s has not been changed

Again, our string hasn’t been changed at all. What has happened is that Python simply returned a new string according to the instructions we gave, then immediately discarded it, leaving our original string unaltered. To preserve our new string, we need to assign it to a variable.

>>> new_s = s.replace('j', 'x')
>>> s
'Don Quijote'
>>> new_s
'Don Quixote'

But of course, instead of introducing a new variable, we can just reuse the existing one.

>>> s = s.replace('j', 'x')
>>> s
'Don Quixote'

And here, although it may seem that we’ve changed the original string, in actual fact we’ve just discarded it and stored a new string in its place.

Note that, by default, replace() will replace every occurrence of the search sub-string with the new sub-string.

>>> s = 'Don Quijote'
>>> s.replace('o', 'a')
'Dan Quijate'

We can control this extravagance by adding an extra parameter specifying the maximum number of times that the search substring should be replaced.

>>> s.replace('o', 'a', 1)
'Dan Quijote'

Finally, the replace() method is not limited to acting on single characters. We can replace a whole chunk of the target string with some specified value.

>>> s.replace(' Qui', 'key ')
'Donkey jote'

 

Leave a Reply

Your email address will not be published. Required fields are marked *