Every once in awhile, I’ll come across a well-established library in a programming language that has its quirks. As an instructor, I have to make sure I’m aware of these quirks when I’m teaching. For instance, last time I talked a bit about the various Scanner input methods and how they don’t all behave the same way. Well today, I want to talk about the substring method from Java’s String library.
Documentation
When using a library for the first time, I find it useful to check out the documentation. But with a library so established, it sometimes feels silly to dig into the documentation. After all, a lot of languages support strings. Personally, all I need to know is the name of the command before I can figure out the rest.
However, every once in awhile, I’ll come across a function that is less intuitive than I thought. In this case, I’m talking about Java’s substring method. As you can probably imagine, it grabs a substring from a string and returns it. So, what’s the catch?
Well for starters, the substring method is actually an overloaded method. As a result, there are two different forms of the same method in the documentation. Take a look:
public String substring(int beginIndex)
Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.
Java API, 2019
public String substring(int beginIndex, int endIndex)
Returns a new string that is a substring of this string. The substring begins at the specified
beginIndex
and extends to the character at indexendIndex - 1
. Thus the length of the substring isendIndex-beginIndex
.Java API, 2019
At this point, don’t fixate too much on their descriptions as we’ll get to those. Just be aware that there are two different versions of the same method.
Usage
At this point, I’d like to take a moment to show how to use the substring method. If this is your first time poking around the Java API, this would be a good time to follow along.
First, notice that the method header does not contain the static keyword. In other words, subtring is an instance method which makes sense. We need an instance of a string in order to get a substring:
String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);
In this example, we’ve created two new substrings: one from position 7 to the end and the other from position 0 to position 5. Without looking at the documentation, can you figure out what the resulting strings will be?
Interval Notation
Before I give away the answer, I think it’s important to discuss some terminology from mathematics. In particular, I’d like to talk a bit about interval notation.
In interval notation, the goal is to explicitly state the range of some subset. For instance, we may be interested in all integers greater than 0. In interval notation, that would look something like:
(0, +∞)
In this example, we’ve chosen to exclude the value of 0 from the range using parentheses. We could have just as easily defined the interval starting with 1—pay attention to the brackets:
[1, +∞)
In either case, we’re describing the same set: all integers greater than 0.
So, how does this tie into the substring method? As it turns out, a substring is a subset of a string, so we can use interval notation to define our substring. Why don’t we try a couple examples? Given “Hello, World!”, determine the substring using the following intervals:
- [0, 2]
- (0, 5]
- (1, 3)
- (-1, 7]
Once you’re done, check out the answers below:
- “Hel”
- “ello,”
- “l”
- “Hello, W”
We’ll need to keep this idea in the back of our mind moving forward.
The Truth
The truth of the matter is the substring method is a bit weird. On one hand, we can use a single index to specify the starting point of our new substring. On the other hand, we can use two indices to grab an arbitrary subset of a string.
However, in practice, I find that the second option gives a lot of students trouble, and I don’t blame them. After all, the bounds are deceptive. For example, let’s revisit some code from above:
String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);
Here, we can confidently predict that subOne
has a value of “World!”, and we’d be right. After all, index 7 is ‘W’, the method automatically grabs the rest of the string.
As for subTwo
, we’d probably guess “Hello,”, and we’d be incorrect. It’s actually “Hello” because the end index is exclusive (i.e. [0, 5) ). In the next section, we’ll take a look at why that is and how I feel about it.
My Take
From what I understand, the inclusive/exclusive model is the standard for ranges in the Java API. That said, I do occasionally question the design choice.
On one hand, there’s the advantage of being able to use the length of the string as the end point of the substring:
String jokerQuote = "Madness, as you know, is like gravity, all it takes is a little push.";
String newtonTheory = jokerQuote.substring(30, jokerQuote.length());
But, is this really necessary? Java already provides an overload to the substring method which captures exactly this behavior.
That said, there is a nice mathematical explanation for this notation, and part of it has to do with the difference between the starting and ending points. In particular, we get the length of the new substring:
int length = endIndex - startIndex;
In addition, this particular notation allows adjacent substrings to share a midpoint:
String s = "Luck is great, but most of life is hard work.";
String whole = s.substring(0, s.length()/2) + s.substring(s.length()/2, s.length());
Both of these properties are nice, but I think they're likely a byproduct of indexing by zero (perpetuated by Dijkstra) which isn't all that intuitive either. And for those of you who are going to take exception to that comment, be aware that I'm all for indexing by zero and and this inclusive/exclusive subset convention.
All I'm trying to say is that I've seen my own students get tripped up over both conventions, so I feel for them in a way. That's why I went through such lengths to write this article in the first place.
Let me know if you feel the same or if I’m totally off base. Otherwise, thanks for taking some time to read my work. I hope you enjoyed it!
Top comments (9)
I find it completely normal. Imagine the following: you want to get 3 characters counting from index 5. That means you want from 5 to 5+3=8.
Also, in many others languages you either specify the length of the substring of follow the rule explained. Other than that, you usually do for loops as follow
for (int i=0; i<3; i++), and you already know that i will never be 3.
I totally agree! Both of your examples make perfect sense for people who have coded for a bit. After all, we've all agreed that indices start from 0 (perpetuated by Dijkstra), but that's not intuitive for new folks either.
EDIT: I should clarify that we don't all agree on indexing from 0, but I'd argue that all of the most currently dominant languages index from 0.
We have not agreed to that at all. As in Dijkstra note you linked, ALGOL and Pascal indices start at 1. This is also the case with XSLT/XPath/XQuery.
I don't understand why you are trying to blame index-0 on Dijkstra. His note was written in 1982, years after languages like C were defined. Dijkstra did not set the rule on where to start. Just because he voiced his reasoned opinion on a subject does not give him the blame for what others did.
Besides, the natural world also starts at 0. When you are born you are in your first year, which is from 0 to 1. This confusing problem is everywhere, not just in programming languages.
Again, I agree all the way that this problem is confusing! The entire point of this article is that certain conventions are not always intuitive. That doesn't mean they're bad. It just means there should be a good reason for them.
Also, I'm not saying that Dijkstra is the reason for indexing from 0, but he's clearly made the strongest case for it. There's been less time between the first programming language and what Dijkstra said (24 years) than what he said and today (37 years). He's had an incredible influence on the field in the last 40 years.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.