Methods

2. Basic Types: Strings

Methods

Strings generally occur in the context of data analysis. As a result, it is quite common that your strings are sampled from external sources and need to be processed accordingly. To this end, the $\color{#4271ae} {\mathtt{\text{str}}}$ typing provides a multitude of methods.

We can find the number of characters used within a string, in other words: its length, by using the $\mathtt{\color{#4271ae}{\text{len}}\text{()}}$ function. Note that punctuation and even whitespace also count as characters.

>>> len("Hello, world!")

We can also obtain the number of occurrences of characters or substrings within a string.

>>> "Hello, world!".count("o")

>>> "Hello, world!".count("world")

We can check whether a substring occurs in a string using the $\color{#8959A8} {\mathtt{\text{in}}}$ keyword. If we also want to know the position of the first occurrence, we can use $\mathtt{\color{#4271ae} {\text{find}}\text{()}}$ .

>>> "world" in "Hello, world!"

True

>>> "Hello, world!".find("world")

Indexing

The $\color{#4271ae} {\mathtt{\text{str}}}$ typing can be indexed; we can request characters at certain positions. The index always starts at $0$ . We index a string by denoting the position(s) of the desired character(s) after a string or variable within square brackets.

>>> "Hello, world!"[0]

'H'

>>> "Hello, world!"[1]

'e'

In this case, the last character is at position $\mathtt{\color{#4271ae} {\text{len}}\text{(}\color{#4271ae}{\text{str}}\text{)}}-1=12$ . However, we might want to obtain the last character without knowing the length of the string in advance. In that case, we can use negative indexing, which allows us to obtain characters starting from the last character at position $-1$ . An overview of the indexing of the entire string:

$\mathtt{\text{"}}$	$\mathtt{\text{H}}$	$\mathtt{\text{e}}$	$\mathtt{\text{l}}$	$\mathtt{\text{l}}$	$\mathtt{\text{o}}$	$\mathtt{\text{,}}$		$\mathtt{\text{w}}$	$\mathtt{\text{o}}$	$\mathtt{\text{r}}$	$\mathtt{\text{l}}$	$\mathtt{\text{d}}$	$\mathtt{\text{!}}$	$\mathtt{\text{"}}$
	0	1	2	3	4	5	6	7	8	9	10	11	12
	-13	-12	-11	-10	-9	-8	-7	-6	-5	-4	-3	-2	-1

As shown above any character can be obtained with two indices; a positive or negative index, either starting from the left or right. Which type of indexing you use depends on the problem at hand.

>>> "Hello, world!"[12]

'!'

>>> "Hello, world!"[-1]

'!'

>>> "Hello, world!"[-2]

'd'

We can also index using a slice; a range of indices. This allows for the sequencing of entire substrings at given positions. A slice has the form $\mathtt{\text{[}}\textit{start}:\textit{end}\textit{ \{}:\textit{step\}}\mathtt{\text{]}}$ , where start is inclusive and end is exclusive. step is an optional value that can be used to only sample characters at indices that are multiples of step within the range $\mathtt{\text{[}}\textit{start}, \textit{end}\mathtt{\text{)}}$ , starting from start. Any of the arguments can be omitted if they represent their default values ( $\mathtt{\textit{start}\text{ = 0}}$ , $\mathtt{\textit{end}\text{ = end}}$ , $\mathtt{\textit{step}\text{ = 1}}$ ).

>>> "Hello, world!"[0:-1]

'Hello, world'

>>> "Hello, world!"[:5]

'Hello'

>>> "Hello, world!"[7:]

'world!'

By omitting start and end you can apply step indexing over the entire string. If you were to specify them you would only apply step indexing within that range.

>>> "abcdefghijklmnopqrstuvwxyz"[::2]

'acegikmoqsuwy'

>>> "abcdefghijklmnopqrstuvwxyz"[0:5:2]

'ace'

The most common use of the step indexing functionality is negative step indexing, which results in the string, but in reverse.

>>> "abcdefghijklmnopqrstuvwxyz"[::-1]

'zyxwvutsrqponmlkjihgfedcba'

Capitalization

The $\color{#4271ae} {\mathtt{\text{str}}}$ typing offers several methods that make (de-)capitalization of (sub)strings easier for the user. The first we'll discuss are the condition testing methods $\mathtt{\color{#4271ae}{\text{islower}}\text{()}}$ and $\mathtt{\color{#4271ae}{\text{isupper}}\text{()}}$ , which return $\mathtt{\color{#F5871F} {\text{True}}}$ if all characters within the string are lower- or uppercase respectively, ignoring punctuation and special characters.

>>> "Hello, world!".islower()

False

>>> "a lowercase string.".islower()

True

>>> "AN UPPERCASE STRING.".isupper()

True

To convert between upper- and lowercase, the $\mathtt{\color{#4271ae}{\text{str}}}$ typing offers the functions $\mathtt{\color{#4271ae}{\text{upper}}\text{()}}$ , $\mathtt{\color{#4271ae} {\text{lower}}\text{()}}$ and $\mathtt{\color{#4271ae}{\text{swapcase}}\text{()}}$ . The $\mathtt{\color{#4271ae}{\text{upper}}\text{()}}$ and $\mathtt{\color{#4271ae}{\text{lower}}\text{()}}$ methods convert all the characters in the string to upper- or lowercase respectively, while $\mathtt{\color{#4271ae}{\text{swapcase}}\text{()}}$ swaps the case of all characters.

>>> "Hello, world!".upper()

'HELLO, WORLD!'

>>> "Hello, world!".lower()

'hello, world!'

>>> "Hello, world!".swapcase()

'hELLO, WORLD!'

In addition to methods that swap between upper- and lowercase, the $\color{#4271ae} {\mathtt{\text{str}}}$ typing also offers some more unique capitalization methods, such as $\mathtt{\color{#4271ae} {\text{capitalize}}\text{()}}$ , which capitalizes the first letter of the string, and $\mathtt{\color{#4271ae} {\text{title}}\text{()}}$ , which capitalizes the first letter of every word in the string.

>>> "a lowercase string.".capitalize()

'A lowercase string.'

>>> "a lowercase string".title()

'A Lowercase String'

Similar to the methods shown above, $\mathtt{\color{#4271ae}{\text{istitle}}\text{()}}$ can be used to test whether a string is title-cased.

>>> "A Lowercase String".istitle()

True

Numerical input

The method $\mathtt{\color{#4271ae}{\text{isdecimal}}\text{()}}$ can be used to test if a string only contains numbers. It can be useful to test for this before casting $\color{#4271ae}{\mathtt{\text{str}}}$ to a numeric datatype for example.

>>> "10".isdecimal()

True

>>> int("10")

Note that this doesn't work for strings that could potentially be cast to $\color{#4271ae}{\mathtt{\text{float}}}$ .

>>> "3.14".isdecimal()

False

>>> float("3.14")

3.14

Comparisons

When comparing individual characters their unicode values are used, this generally results in behaviour you would expect for letters of the same casing, but when special characters and different casings get involved, it isn't quite as intuitive anymore. You can use the $\mathtt{\color{#4271ae}{\text{ord}}\text{()}}$ function to obtain the integer value that is used for the comparison.

>>> ord("a")

>>> "a" < "b"

True

>>> "a" < "A"

False

>>> "a" < "!"

False

Since strings are generally sequences of characters, comparisons between longer strings work differently. When comparing two strings, the first characters of both strings are compared, if they're the same, the second characters of both strings are compared, and so on. When both characters differ, the result of the comparison is returned.

>>> "abc" < "abd"

True

>>> "abcdefg" > "aacdefg"

True

If two strings are the same up to some point, but one contains more characters after, the longer string is considered to be greater.

>>> "abcd" > "abc"

True

>>> "abcdefg" > "xyz"

False

A common misconception is that the sum of the unicode values of all the characters within the strings is compared instead. This is absolutely false.

>>> ord("x") + ord("y") > ord("z")

True

>>> "xy" > "z"

False