C++ Tutorial - String
Strings is a source of confusion because it is not clear what is meant by string. Is it an ordinary charater array of type char* (with or without the const), or an instance of class string? In general, we use C-string for the type of char* or const char * and we use string for objects of C++ standard library.
C++ has two types of string:
- C-style character string
- C++ <string> class which is Standard C++ string.
In C, when we use the term string, we're referring to a variable length array of characters. It has a starting point and ends with a string-termination character. While C++ inherits this data structure from C, it also includes strings as a higher-level abstraction in the standard library.
The primary difference between a string and an array of characters revolves around length. Both representations share the same fact: they represent contiguous areas of memory. However, the length of an array is set at the creation time of the array whereas the length of a string may change during the execution of a program. This difference creates several implications, which we'll explore shortly.
Here is the quick summary:
- "Hello" is a string literal, and we may want to use the following to assign it to a pointer:
const char *pHello = "hello";
We should always declare a pointer to a string literal as const char *.
strlen(pHello)=5, the terminating null character is not included. - To store C++ string to C-style string:
string str = "hello"; const char *cp = str.c_str(); char *p = const_cast<char *> (str.c_str());
The following code is converting int to string, and it can be an example of returning a pointer to character array from C++ string.#include <sstream> char *int2strB(int n, char *s) { stringstream ss; ss << n; string cpp_string = ss.str(); const char *pCstring = cpp_string.c_str(); s = const_cast<char *>(pCstring); return s; }
In the code, we made a C++ string from stream (ss.str()), then converted it to C string (const char *) using c_str(). Finally, we casted the constantness (const_cast<char *>) to make it char * type. - An array name is a constant pointer to the first element of the array, and that's why we can't even copy arrays using assignment.(arrayA = arrayB is not allowed)
int a[100]; int b[100]; a = b; //error char cArray[10]; cArray = "Hello!"; //error char *cPtr; cPtr = "Hello!"; //OK // conversion - array name and pointer int *iPtr; iPtr = a; // OK iPtr = &a;[0]; // OK int *iPtr2 = new int[100]; a = iPtr2; // error
When we do arrayA = arrayB, the intention is to make arrayA refer to the same area of memory as arrayB. This will be a compile error because we can't change the memory location to which arrayA points to. If we really want to copy arrayB into arrayA, we need to write a loop that does element-by-element assignment, or use a memory library function such as memcpy.Arrays can be implicity converted to pointers without casting. However, there is no implicit conversion from pointers to arrays.
Though the Standard C library is included as a part of Standard C++, to use C library, we need to include the C header file:
#include <cstring>
C string is stored as a character array. Actually, a string is a series of characters stored in consecutive bytes of memory. The idea of a series of characters stored in consecutive bytes implies that we can store a string in an array of char, with each character kept in its own array element.
C-style strings have a special feature: The last character of every string is the null character.
char non_string [10] = {'n','o','n','_','s','t','r','i','n','g'}; char a_string [9] = {'a','_','s','t','r','i','n','g','\0'};
Both of these are arrays of char, but only the second is a string (an array of 9 characters). So, the 8-character string requires a 9-character array. This scheme makes finding the length of the string an O(n) operation instead of O(1) operation.
So, the strlen() must scan through the string until it finds the end. For the same reason that we can't assign one C array to another, we cannot copy C strings using '=' operator. Instead, we generally use the strcpy() function. However, note that it has also long been recommended to avoid standard library functions like strcpy() which are not bounds checked, and can cause buffer overflow.
We manipulate the C string using a pointer. That's why C string is sometimes called pointer-based string. For example,
const char *str ="I am a string";
we traverse it one by one each character:
while(*str++){ }
The char* pointer str is dereferenced, and the character addressed is checked if it's true or false (if it's null).
Actually, a string literal may be used as an initializer in the declaration of either a character array or a variable of type char *. The declarations:
char str[] = "hello"; const char *pStr = "hello";
In both cases, each initialize a variable to the string "hello". The first declaration creates a six-element array str containing the characters 'h', 'e', 'l', 'l', 'o' and '\0'. The second declaration creates pointer variable pStr that points to the letter h in the string "hello", which also ends in '\0', in memory.
We cannot modify string literals; to do so is either a syntax error or a runtime error. String literals are loaded into read-only program memory at program startup. We can, however, modify the contents of an array of characters.
Attempting to modify one of the elements str is permitted, but attempting to modify one of the characters in pStr will generate a runtime error, causing the program to crash.
Just to get the feeling of C string, let's briefly look at the strlen():
size_t strlen ( const char * str );
The length of a C string is determined by the terminating null-character. The length of a C string is the number of characters and the terminating null character is not included. Sometimes, we get confused with the size of the array that holds the string:
char str[50]="0123456789";
defines an array of characters with a size of 50 chars, but the C string, str, has a length of only 10 characters. So, sizeof(str) = 50, but strlen(str) = 10.
String literals have static storage class, which means they exist for the duration of the program. They may or may not be shared if the same string literal is referenced from multiple locations in a program. According to the C++ standard, the effect of attempting to modify a string literal is undefined. Therefore, we should always declare a pointer to a string literal as const char *.
Since there are still many program situations which require understanding C-style string, we need to be familiar with the C-style string.
Let's find out how much we know about C-string using the examples below. Can you figure out what's wrong?
Question 1
char *cstr1 = "hello"; *(cstr1)='t';
Question 2
char *cstr2; strcpy(cstr2, "hello"); *(cstr2)='t';
Question 3
char cstr3[100]; cstr3 = "hello"; *(cstr3)='t';
Question 4
char cstr4[100] = "hello"; *(cstr4)='t';
Answer 1
Compiles successfully.
In run time, however, at the moment when it tries to write, it fails.
We get this message:
Unhandled exception at 0x00411baa in cstring.exe: 0xC0000005:
Access violation writing location 0x0041783c.
When our program is compiled, the compiler forms the object code file, which contains our machine code and a table of all the string constants declared in the program. In the statement,
char *cstr1 = "hello";causes cstr1 to point to the address of the string hello in the string constant table. Since this string is in the string constant table, and therefore technically a part of the executable code, we cannot modify it. We can only point to it and use it in a read-only manner.
The "hello" is a string literal (or string constant) because it is written as a value, not a variable. Even though string literals don't have associated variables, they are treated as const char*'s (arrays of constant characters). String literals can be assigned to variables, but doing so can be risky. The actual memory associated with a string literal is in a read-only part of memory, which is why it is an array of constant characters. This allows the compiler to optimize memory usage by reusing references to equivalent string literals (that is, even if your program uses the string literal "hello" 100 times, the compiler can create just one instance of hello in memory). The compiler does not, however, force our program to assign a string literal only to a variable of type const char* or const char[]. We can assign a string to a char* without const, and the program will work fine unless you attempt to change the string, which is what we're trying to do in the last line of the Question #1.
Answer 2
This code is compiled successfully. But
we need to allocate memory for the character pointer.
The right code should look like this:
char *cstr2 = (char*)malloc(strlen("hello")+1); strcpy(cstr2,"hello"); *(cstr2)='t';
Answer 3
This won't compile.
At the line:
cstr3 = "hello";we get the following message:
cannot convert from 'const char [6]' to 'char [100]'
Since the string hello exists in the constant table, we can copy it into the array of characters named cstr3. However, it is not a pointer, the statement
cstr3="hello";will not work.
We can think of the problem this way: the pointer we get from the name of an array as a pointer to its first element is a value NOT a variable, so we cannot assign to it.
In fact, an array name is a constant pointer to the first element of the array. As a consequence of this implicit array-name-to-pointer conversion, we can't even copy arrays using assignment:
int a[100]; int b[100]; ... a = b; //error
Answer 4
No problem at all.
Summary
In general, the Asnwers from 1 to 4 may vary depending on compiler. So, the best way of avoiding unexpected run-time error, is to use a pointer to const characters when referring to string literals:
const char* ptr = "hello"; // Assign the string literal to a variable. *ptr = 't'; // Error - Attempts to write to read-only memory ptr[0] = 't'; // expression must be a modifiable lvalue
Standard C library gives us a set of utility functions such as:
/*returns the length of the string*/ int strlen(const char*); /*copies the 2nd string into the 1st*/ char* strcpy(char*, const char*); /*compares two strings*/ int strcmp(const char*, const char*)
Some of the source code for the library utility functions can be found Small Programs.
Because we manipulate the C string using pointer which is low-level operation, it's error prone. That's why we have C++ <string> class.
Though there are lots of advantages of C++ string over C string, we won't talk about it at this time. But because numerous codes still using C style string, we need to look at the conversion between them. Here is an example with some error messages we'll get if we run it as it is.
#include <string> #include <cstring> int main() { using namespace std; string str1; const char *pc = "I am just a character array"; // C++ string type automatically converting a C character string // into a string object. // string class defines a char* - to-string conversion, which makes // it possible to initialize a string object to a C-style string. str1 = pc; //ok // error C2440: 'initializing': // cannot convert from 'std::string' to 'char *' char *p1 = str1; //not ok // error C2440: 'initializing' : // cannot convert from 'const char *' to 'char *' char *p2 = str1.c_str(); //not there yet const char *p3 = str1.c_str(); //ok // removing (casting) constantness char *p4 = const_cast<char *> (str1.c_str()); // ok return 0; }
The c_str() returns the contents of the string as a C-string. Thus, the '\0' character is appended. Actually, it returns a pointer to const array in order to prevent the array from being directly manipulated. That's why we need const qualifier in the example code. Note that C++ strings do not provide a special meaning for the character '\0', which is used as special character in an ordinary C-string to mark the end of the string. The character '\0' is part of a string just like every other character.
Here are the couple of ways to initialize a string:
string s1; // Default constructor; s1 is an empty string string s2(s1); // Initialize s2 as a copy of s1; string s3("literal"); // Initialize s3 as a copy of a string literal string s4(n, 'c'); // Initialize s4 with n copies of a character 'c'
The string class provides several constructors. If we define a string type without explicitly initializing it, then default constructor is used.
C++ provides two forms of variable initialization:
int i(256); // direct initialization int i = 256; // copy initialization
Note that initialization and assignment are different operations in C++. When we do initialization, a variable is created and given its initial value. However, when we do assignment, an object's current value is obliterated and replaced with a new one. For a built-in type variable like int, there is little difference between the direct and the copy forms of initialization. However, when we deal with more complex types, the difference becomes clear. The direct initialization tends to be more efficient.
- s1=s2
Assign s2 to s1; s2 can be a string or a C-style string.
- s+=a
Add a at end; a can be a character or a C-style string.
- s[i]
subscripting
- s1+s2
Concatenation; the characters in the resulting string will be a copy of those from s1 followed by a copy of those from s2.
- s1<s2
Lexicographical comparison of string values; s1 or s2, but not both, can be a C-style string.
- s1==s2
Comparison of string values; s1 or s2, but not both, can be a C-style string.
- s.size()
Number of characters in s.
- s.length()
Number of characters in s.
- s.c_str()
C-style version of characters in s.
- s.begin()
Iterator to first character.
- s.end()
Iterator to one beyond the end of s.
- s.insert(pos,a)
Insert a before s[pos]; a can be a character, a string, or a C-style string. s expands to make room for the characters from a.
- s.append(pos,a)
Insert a after s[pos]; a can be a character, a string, or a C-style string. s expands to make room for the characters from a.
- s.erase(pos)
Remove the character in s[pos]; s's size decreases by 1.
- pos=s.find(a)
Find a in s; a can be a character, a string, or a C-style string. pos is the index of the first character found, or npos (a position off the end of s. Wat if we can't find the string as in the example below?
if(s.find("mystring") == string::npos) cout << "can't find it \n";
Because "mystring" does not exist in "s", find() returns a constant which we access with string::npos. As a result, it displays the message. The string::npos represents the largest possible size of a string object, and it means a position that can't exist. So, it is the perfect return value to indicate the failure of finds - in >> s
Read a whitespace-separated word into s from in.
- getline(in,s)
Read a line into s from in.
- out << s
Write from s to out.
Take a look at the following code:
char name[12] = "Alan Turing"; std::cout << name << " is one of the greatest.\n";
The name of an array is the address of its first element. The name in the cout is the address of the char element containing the character A. The cout object assumes that the address of a char is the address of a string. So, it prints the character at that address and then continues printing characters until it meets the null character, '\0'.
In other words, a C string is nothing more than a char array. Just as C doesn't track the size of arrays, it doesn't track the size of strings. Instead, the end of the string is marked with a null character, represented in the language as '\0'. So, if we give the cout the address of a character, it prints everything from that character to the first null character that follows it.
The key here is that name acts as the address of a char which implies that we can use a pointer-to-char variable as an argument to cout.
What about the other part of the cout statement?
If name is actually the address of the first character of a string, what is the expression " is one of the greatest.\n"? To be consistent with cout's handling of string output, this quoted string should also be an address. Yes, it is. A quoted string serves as the address of its first element.
It doesn't really send a whole string to cout. It just sends the string address. This means:
- strings in an array
- quoted string constants
- strings described by pointers
The following example shows how we use different forms of strings. It uses two functions from the string library, strlen() and strcpy(). Prototypes of the functions are in cstring header file.
#include <iostream> #include <cstring> int main() { using namespace std; char nameArr[12] = "Alan Turing"; const char *namePtrConstChar = "Edsger W. Dijkstra"; char *ptr; cout << nameArr << " and " << namePtrConstChar << endl; cout << endl; ptr = nameArr; cout << "1: " << nameArr << " @ " << (int *)nameArr << endl; cout << "1: " << ptr << " @ " << (int *) ptr << endl; cout << endl; ptr = new char[strlen(nameArr) + 1]; strcpy(ptr, nameArr); cout << "2: " << nameArr << " @ " << (int *)nameArr << endl; cout << "2: " << ptr << " @ " << (int *) ptr << endl; delete [] ptr; }
Output from the run:
Alan Turing and Edsger W. Dijkstra 1: Alan Turing @ 0017FF1C 1: Alan Turing @ 0017FF1C 2: Alan Turing @ 0017FF1C 2: Alan Turing @ 007A1F20
The code above creates one char array, nameArr and two pointers-to-char variables, nameB and ptr. The code begins by initializing the nameArr to the "Alan Turing" string. Then, it initializes a pointer-to-char to a string:
const char *namePtrConstChar = "Edsger W. Dijkstra";
"Edsger W. Dijkstra" actually represents the address of the string, so this assigns the address of
"Edsger W. Dijkstra" to the namePtrConstChar pointer.
String literals are constants, which is why the code uses the const keyword. Using const means we can use namePtrConstChar to access the string but not to change it.
The pointer ptr remains uninitialized, so it doesn't point to any string.
The code illustrates that we can use the array name nameArr and the pointer namePtrConstChar equivalently with cout. Both are the addresses of strings, and cout displays the two strings stored at those addresses.
Let's look at the following code of the example:
cout << "1: " << nameArr << " @ " << (int *)nameArr << endl; cout << "1: " << ptr << " @ " << (int *) ptr << endl;
It produces the following output:
1: Alan Turing @ 0017FF1C 1: Alan Turing @ 0017FF1C
In general, if we give cout a pointer, it prints an address. But if the pointer is type char *, cout displays the pointed-to-string. If we want to see the address of the string, we should cast the pointer to another pointer type, such as int *. Thus, ptr displays as the string "Alan Turing", but (int *)ptr displays as the address where the string is located. Note that assigning nameArr to ptr does not copy the string, it copies the address. This results in the two pointers (nameArr and ptr) to the same memory location and string.
To get a copy of a string, we need to allocate memory to hold the string. We can do this:
- declaring a second array
- using new
In the code, we use the second approach:
ptr = new char[strlen(nameArr) + 1];
Then, we copy a string from the nameArr to the newly allocated space. It doesn't work if we assign nameArr to ptr because it just changes the address stored in ptr and thus loses the information of the address of memory we just allocated. Instead, we need to use the strcpy():
strcpy(ptr, nameArr);
The strcpy() function takes two arguments. The first is the destination address, and the second is the address of the string to be copied. Note that by using strcpy() and new, we get two separate copies of "Alan Turing":
2: Alan Turing @ 0017FF1C 2: Alan Turing @ 007A1F20
Additional codes related to string manipulation samples which frequently appear at interviews are sources A and sources B
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization