Computer Science 010

Lecture Notes 4

Strings

Walking through arrays using pointer arithmetic

When we use subscripts, the C compiler generates executable code that does the following:

All languages that use array subscripts do the same thing (except some languages will check that the array index is within bounds as well!).

C gets a little funky in that it exposes to the programmer the fact that array elements can be found simply by doing "pointer arithmetic" to get a pointer into the middle of the array!

To see how that works, let's rewrite minarray using pointer arithmetic:

int minarray (int *a, int size) {
  /* The minimum value found.  */
  int min;
   
  /* The address of the last element in the array */
  int *lastElem = a + size - 1;
   
  /* Walk the array. */
  for (min = *a; a <= lastElem; a++) {
    /* If the next value is smaller than the current minimum, remember
       it. */
    if (*a < min) {
      min = *a;
    }
  }
   
  /* Return the minimum value. */
  return min;
}
   

In this example, there is no subscripting. Instead, we are walking through the array by updating the pointer value. Before entering the loop we find the address of the last element of the array. This is located at the original address of the array plus (size - 1). This might not seem right. ints on our Unix machines occupy 4 bytes. Therefore, it would seem that we should multiply size -1 by 4 before adding to p. It turns out that when we do pointer arithmetic, C does this multiplication for us so that we don't need to worry about how big the array elements are. It knows that a is a pointer to an int. So it takes the size of the int (4 bytes) and multiplies size - 1 by 4 before adding it to a. The effect is that we see size - 1 ints ahead in memory instead of size -1 bytes ahead in memory.

Similarly, when we walk through the array, we can just increment a using the ++ operator. This does not increase the value of a by 1, but rather by the size of an int (because a is an int *). If it was a different type of pointer, it might increment by a different amount. (For example, if it is a char *, it will increment by 1 since a char occupies 1 byte.)

Strings

There is no String type in C as there is in Java. Instead, strings are declared as "char *". Recall that this means "pointer to char" but due to the similarity of pointers and arrays, it also means "array of characters". In fact, strings are normally declared as pointers to characters but really it makes more sense to think of them as arrays of characters! You can reference individual characters in a string using array subscripts.

char *name = "Williams";

name[0] is the character 'G'. This string actually requires an array with 12 characters even though "Geroge Bush" is only 11 characters long. The last array element contains a special null character that signals the end of the string. The null character can be written as a character constant as '\0'. When you use a string constant as above, C automatically adds the null character. The equivalent array value is:

char college[] = {'W','i','l','l','i','a','m','s','\0'};

The reason strings are normally declared as pointers is that we typically do not know how big a string will be when we declare the string variable. In fact, we probably will want its size to change over time. C requires us to declare the size of array variables when we declare them so this is inappropriate. Instead, we declare a pointer and then dynamically allocate memory when we know how big the string will be. More on dynamic memory allocation another day...

To input or output a string, use the %s conversion specification in printf and scanf:

char *msg = "Wombat!\n";
printf ("%s", msg);

For scanf, you need to be sure that there is memory to read the string into. In this case, you must declare a string array with a size:

char input[100];
scanf ("%s", input);

Notice that this time you do not need an & before input. That is because input is a pointer (even though it was declared as an array). Its value is therefore an address already so you do not need to use the & address-of operator.

Another useful method for inputting strings is gets:

char *gets (char *s)

gets reads an entire line and returns that line (both in the parameter and the return value). scanf, on the other hand, returns a single word. It does not include whitespace in the value that it assigns to its parameters.

Sample Program

Here is one way that the C library might implement strcpy:

char *strcpy (char *dst, const char *src) {
  int i = 0;
   
  for (i = 0; src[i] != '\0'; i++) {
    dst[i] = src[i];
  }
  dst[i] = '\0';
  return dst;
}

Here is another way to write the same function using pointer arithmetic instead of subscripting:

char *strcpy (char *dst, const char *src) {
  char *retval = dst;
  while (*dst++ = *src++) ;
  return retval;
}

All the action happens in the condition of the while loop! First the value pointed to by src is copied to the current dst address. Then both the src and dst addresses are updated to point to the next char. Now, the assignment expression returns the value assigned. A non-zero value means true, so as long as the value that was assigned was not the terminating null character (whose value is 0), the loop continues. Eventually, the null character is assigned, the loop terminates, and the copy is returned.

Which is better? I definitely prefer the first version. It is far easier to understand and thus also far easier to get it right. The bottom one, however, is a good demonstration of the low-level programming possible in C.

String functions

Besides I/O, most string manipulation is done using library functions. To use any of the functions below, you must include the string.h file at the beginning of your program:

#include <string.h>

Here are the most common functions:

size_t strlen (const char *s)

This returns the length of s. That is the number of characters up to, but not including, the terminating null character. It is not the size of the array used by s, although these may coincide. The const syntax indicates that s is not changed in the body of strlen. size_t is a type defined to mean "unsigned int".

char *strcpy (char *dst, const char *src)

This copies the string src to the string dst. dst must be an array large enough to hold src. dst is also the return value. This differs from assignment because it makes a copy. If you use an assignment like dest = src;, dest and src point to the same chunk of memory.

int strcmp (const char *s1, const char *s2)

This compares s1 to s2. If s1 is alphabetized before s2, strcmp returns a negative value. If s1 and s2 have the same string values, strcmp returns 0. If s2 alphabetizes before s1, a positive value is returned. strcmp is different from using the == operator exactly as .equals and == have different meanings in Java. == just compares the pointers, while strcmp compares the values pointed to. Note that this comparison is case sensitive. All capitalized letters precede all lowercase letters in this ordering.

Be careful when you use strcmp. Remember that it really returns an integer, not a boolean, not even an integer pretending to be a boolean. If the two strings are equivalent, strcmp returns 0 which evaluates to false, not true as a boolean! Don't say:

if (strcmp (str1, str2)) {
   ....
}

if you mean to enter the if-statement when the strings are the same!

You can get more information about any of these functions and other string functions by looking them up in section 3 of xman. (Click on the manual page button of xman. Pull down the Sections command and select (3) Subroutines.) Section 3 contains functions that can be called from C programs. One important piece of information on the manual pages for functions is an indication of which .h files you must include in your program. string.h includes prototypes for these three functions and others. It also includes the definition of size_t, the type returned by strlen. size_t is defined as follows:

typedef unsigned int size_t;

Unfortunately, the man page does not tell you this, so you may be uncertain what you can do with the return value of strlen. The files listed on man pages reside in the /usr/include directory. You can view these pages with more or emacs to find the definition of types such as size_t when the manual page is insufficient.


Return to CS 010 Home Page