About the Author

me

I am a 24 year old Computer Science student at University of New Hampshire. I'm graduating in May, and currently searching for full time jobs. You can find my resume along with other info about me on my personal page: Daniel P. Noe.

 
-->

26 September 2006 - 21:55Neat tricks with C Arrays and Pointers

In C, arrays are really just pointers with some “syntactic sugar” around them to make access easier. When you use square brackets to access ar[2], ar is a pointer and the compiler translates ar[2] into *(ar + 2). The pointer arithmetic increments the pointer ar by two times the size of an element of ar. Then, it is dereferenced. This has an interesting consequence in that ar[2] == 2[ar]. Both of these allow you to access the the same location in memory and thus the same array member!

Why is this? Remember that a pointer is just an integer value that references a location in memory. So, when you write 2[ar] it translates to *(2 + ar). The address 2, plus the base address of the array. Which is the same as the base address of the array plus two. A really quick snippet of C can verify that this is true:


int main(int argc, char* argv[]) {
    char test[10] = "ABCDEFG";
    if (test[2] == 2[test]) {
        puts("whoa!n");
    }
    return 0;
}

If you compile and run this, it will print “whoa!” as expected, because C == C. But is that really the reason? I compiled this code with gcc -S test.c to get the results (note some code including the string initialization is omitted):


main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movq    .LC0(%rip), %rax
    movq    %rax, -16(%rbp)
    movw    $0, -8(%rbp)
    movl    $.LC1, %edi
    call    puts
    movl    $0, %eax
    leave
    ret

There are no compare or conditional branch instructions in there! That means that without any optimization options turned on GCC has generated code which will always execute the code within the if. This is even without the “constant folding” code which attempts to recognize and optimize constant sections of code. The compiler recognizes that *(ar + 2) = *(2 + r). If the C code is changed slightly so the value is assigned to a temporary variable before the comparison:


char tmp = 2[test];
if (test[2] == tmp) {
    puts("whoa!n");
}

main:
.LFB2:
    pushq   %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    subq    $32, %rsp
.LCFI2:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movq    .LC0(%rip), %rax
    movq    %rax, -16(%rbp)
    movw    $0, -8(%rbp)
    movzbl  -14(%rbp), %eax    /*  Move element 2 to EAX */
    movb    %al, -1(%rbp)        /*  Move low order byte of EAX to temporary stack location */
    movzbl  -14(%rbp), %eax    /*  Move element 2 to EAX */
    cmpb    -1(%rbp), %al        /*  Compare temporary stack location and low order byte of EAX */
    jne .L2                             /*  Jump if not zero (equal) */
    movl    $.LC1, %edi
    call    puts
.L2:
    movl    $0, %eax
    leave
    ret

In this case you can clearly see the array element value being retrieved from the stack and placed into the temporary value (seen here as an offset off the base pointer) and then the identical address array element being retrieved into the EAX register, then compared with the temporary value. Interestingly enough, if I compile the same code with the GCC optimization option -O2 it yields a dramatically less complicated set of assembly instructions:


main:
.LFB24:
    subq    $8, %rsp      /* Allocate memory on the stack? */
.LCFI0:
    movl    $.LC0, %edi   /* Move address of "whoa!n" into %edi */
    call    puts          /* Call puts to output %edi string */
    xorl    %eax, %eax    /* Zero out %eax (return 0) */
    addq    $8, %rsp      /* Unwind the stack pointer */
    ret

Optimization really does improve the code! In this case GCC has once again figured out that the two array references are identical. But it has gone much farther this time and optimized away the temp variable, the comparison, and the array altogether! No memory at all was allocated for the array, and all references to it were gone from the assembly language code. Of course, the compiler is correct. The array has absolutely no purpose in our code given the foregone conclusion of the if block executing. You can even see that GCC has changed to using an xor instruction to zero out %eax (return register) instead of using a mov instruction - which is probably faster (note that anything xor’d with itself is always 0). Optimized code really does make a difference.

I don’t know why the compiler has created the two instructions which manipulate the stack pointer. After some thought, the 8 byte allocation would correspond to a pointer to the string literal at label .LC0. I’m not as familiar with how the assembler treats strings. Obviously 8 bytes does not correspond to the entire string, which was previously confusing me. If you know more, please drop a comment. Thanks to Stu for pointing out the obfuscated array trick.

4 Comments | Tags: computers

25 September 2006 - 22:38Loop Optimization

A common beginner exercise in C is to implement some of the string routines. The strcpy routine takes two arrays of characters (terminated by a “null” character and copies the first to the second. Which of these two implementations is better?


char* strcpy (char* a, char* b) {
    int i;
    for (i = 0; i < strlen(a); i++) {
        b[i] = a[i];
    }
    return b;
}

or


char* strcpy (char* a, char* b) {
    int i;
    int len = strlen(a);
    for (i = 0; i < len; i++) {
         b[i] = a[i];
    }
    return b;
}

If you answered the second one, you’re right. The for loop has three parameters. The first is executed before the start of the loop. The second is executed as a loop conditional before the start of each iteration. The third is executed at the end of each iteration. While the first version above seems to be worse because it uses an extra local variable, strlen() is executed once for every iteration of the loop.

In C, strings are terminated by a special character (null). strlen() works by looking forward through the string until it finds the null character. This means the strlen() needs to perform comparison operations on the order of n, where n is the length of the string. And since the for loop in the code above also executes n times, the number of comparisons in the fast loop will be on the order of n and in the slow loop will be on the order of n squared. This is a significant difference when the string gets large!

Of course, an astute observation is that we don’t need strlen() at all. If you iterate over a and b side by side and check a for null before copying you don’t need another step to determine strlen. This avoids the expensive function call to strlen and the double work involved in traversing the string twice. As compiled by GCC with optimization, the assembly language for this version looks like this:


    movzbl  (%rdi), %eax   /* Move value at address %rdi to %eax */
    testb   %al, %al       /* AND low order byte of %eax with itself */
    je  .L2                /* jump to L2 if result was 0 (null) */
    .p2align 4,,7
.L4:
    movb    %al, (%rsi)    /* Move low order byte of %eax into address %rsi */
    movzbl  1(%rdi), %eax  /* Move value at %rdi + 1 to %eax */
    incq    %rsi           /* Increment %rsi */
    incq    %rdi           /* Increment %rdi */
    testb   %al, %al       /* AND low order byte of %eax with itself */
    jne .L4                /* Jump to L4 if not equal to zero */
.L2:
    movq    %rsi, %rax     /* Move %rsi to %rax (return value) */
    ret                    /* Return to caller */

The loop is now quite compact and involves only six instructions. Furthermore, the instructions are relatively simple (no CALL instructions which are very expensive!). But there is still room for improvement. Conditional branches are expensive. The CPU has prepared to execute instructions in a linear fashion and may already be decoding subsequent instructions when there suddenly needs to be a branch. The hardware in the CPU tries to predict which branches will be taken, but this logic isn’t perfect and the CPU stalls for many cycles when the branch prediction fails and it needs to refill the instruction pipeline. To avoid this, we can unroll the loop by processing several elements in each iteration of the loop. This reduces the total number of branches at the expense of an increase in code size and register usage.

The “real” strcpy from glibc on my workstation is actually hand coded in x86_64 assembly. It uses a number of additional tricks including loop unrolling (performing several copies on each loop iteration) and copying multiple characters at once. Check out the x86_64 glibc version here.

2 Comments | Tags: computers

25 September 2006 - 19:35Gentlemen, keep your engines running

A company named EEStor made big news recently with the announcement of a new ultracapacitor they plan to use in electric cars. The company claims the car will run for 500 miles on around $9 worth of electricity, and will charge in five minutes. There are a few problems with this. The first is that $9 is an absolutely meaningless number from a technical standpoint. Electric rates vary significantly between different regions in the US. However, a conservative assumption is that the company chose the lowest rates found in the US, 5.81 cents per kilowatt-hour (Source: DOE). $9 worth of electrity at $0.0581/Kwh is around 155 kilowatt-hours!

A kilowatt hour represents a quantity of energy equal to a 1000 watts in one hour. To transfer 155 kilowatt-hours worth of energy in 5 minutes would necessitate power consumption of 1.86 megawatts. 1.86 megawatts is a lot of power. A normal household outlet is 120 volts. At this voltage, a power of 1.86 megawatts would represent a current of around 15,500 amps. To put this into perspective, a normal household circuit is between 15 and 30 amps. Even at higher voltage the amp draw is incredibly high. The wiring required to support such an incredible current demand safely would be immense. While the promise is good, the numbers just don’t add up.

4 Comments | Tags: scitech

21 September 2006 - 21:57DRM-Free Classical Music Downloads

The Philadelphia Orchestra has just introduced an online music store. They’ll be selling DRM-free (thats Digital Rights Management - the crap that keeps you from doing what you are legally permitted to do with your purchases) recordings in high-bitrate MP3 and lossless FLAC format. Plus, since it is DRM free the downloads should work on any platform (Win/Mac/Linux/etc) and essentially any player.

As a free trial you can download Beethoven’s Fifth Symphony in MP3 format. Check it out!

No Comments | Tags: general

18 September 2006 - 23:10Cooler Weather Cats

Now that it has cooled off a bit we have removed the AC in the computer room and Katz and Smithers have reclaimed a nightly perch on the window sill. There is a big fan which we often have blowing air out the back door, so a cool breeze comes in the window and they seem to like it.

Katz and Smithers

2 Comments | Tags: photos

14 September 2006 - 21:51Sigma, more graphs, and the UNH ACM

In CS520 today I found out that my assignment (mentioned previously) was actually due not today but on Tuesday. So, there were still questions from other students regarding help with the assignment. One student was storing the graph data as a single array containing the upper right triangle of the matrix, read out from left to right one row after another. He was trying to figure out how to index into this array to get a particular item without using a for loop. It is, of course, possible to do this. However, several people in the class insisted that it could not be done.

For a graph with n vertices, this data structure has n - 1 data points in the first row, n - 2 in the second, and so on. The first step in indexing into the array is to figure out where the row you want starts. You could do this using a for loop which sums up a counter with (n - 1) + (n - 2) + …, but this is very inefficient - the algorithm is O(n) versus a possible O(1) implementation. The hint is of course the summing you need to do.

Gauss is famous for figuring out a shortcut way to sum all the numbers between one and 100 in elementary school. What Gauss figured out is that any time you sum a series of numbers between 1 and n the answer is n(n+1)/2. You can use this fact to figure out the indexing problem. First, separate out the sum of i:

sigma((n-1)-i) = sigma(n-1) - sigma(i)

The sum of (n - 1) over i from 1 to n is just n(n-1). We can use the expansion of the sum of i to determine a result:

n(n-1) - (n(n+1)/2)

Not only is it possible without a for loop, it is a very clean result! Of course, our job isn’t finished - we still need to add an offset for the position within the row. This is left as an exercise (it isn’t hard to do. Just think of the relationship between the number of vertices, row number, and the number of items in the row).

After class I was waiting around because I planned to go to the UNH ACM chapter’s free pizza/learn about Linux event. The instructor asked me about some of the other caching things I’d implemented and suggested that the results of searching the graph are also eligible for caching, a possibility I hadn’t considered. It ends up being a significantly more complicated problem, because if you want to keep track of paths discovered between nodes in a reusable fashion. It also led to him asking me if I knew about UROP. While I’d love to do UROP I’m not sure how well that can coincide with trying to work part time.

The UNH ACM chapter event was also fun and although I certainly didn’t learn anything new about Linux there was a good turnout of people who did and I got to meet some new folks.. and re-meet some people who’ve previously been in my classes. It turns out most of them are seniors, so they are looking for not-seniors to get involved as well. Yet another thing to eat my time!

1 Comment | Tags: scitech, computers

13 September 2006 - 20:38Graph Searching in C

For my CS520 class, we were given an assignment to create a program that searches an unweighted, undirected graph for the possibility of a path between two vertices. The basic purpose of the assignment was to be an intro to C, but it is nonetheless an interesting problem.

In mathematics and computer science, a graph is a set of vertices which are connected via edges. If the edges have weights or costs, the graph is weighted. If the edges are only connected in one direction, the graph is a directed graph. Graphs have some really neat uses is computer science. For example, if you needed to route packets between two nodes on a meshed network, you can use a weighted graph to determine the shortest path between the two nodes. One real world application of this is OSPF, a network routing protocol that uses Dijkstra’s Algorithm to determine the shortest path.

Dijkstra’s algorithm is a way to find the shortest path, but for my assignment we just needed to find any path. The method I chose for this is a depth first search. A depth first search is very easy to think about recursively. Here is an example in pseudocode which works recursively:

dfs(i, j)
  if (i == j)
    return true
  else
    for each unvisited vertex v adjacent to i
      if (dfs(v, j) is true)
        return true
    return false

Note that that code snippet only checks for connectivity and does not return the path used to obtain the destination. The path used can be built “on the way back up” as the stack unwinds. That is, the base case step can return the current position then the iterative case can take on its position at the beginning. Then you will end up with a string or list which contains the nodes visited in order. This example really demonstrates the beauty of recursion. There is no need to keep track of where you are in the graph. It is done implicitly by function calls. The only thing you need to take care of is which nodes are visited which is very simple and cheap to do with, say, a bitmap.

I’m not going to post my code here until the due date is passed and gone. I’m pretty proud of the program itself, however. It is very fast. The slowest part of the program when run on large inputs is the reading of the datafile. For a 40,000 node graph the data file measures 763MB! For this input, the program spends about 1:10 reading in the input and searching the graph. For fun, I created a cache which caches the memory representation of the graph. For the 40,000 node graph, this file is approximately 191MB (pretty close to the memory usage of the program). With the cache file in the OS’s disk cache, the graph is searched in under 2 seconds (even for very long paths). It takes a bit longer if the OS has to fetch the cache file from disk, closer to 10 second. Still, a significant improvement. This also demonstrates just how fast the search algorithm itself is. Even on an incredibly complex graph (a 40,000 node graph contains 799,980,000 unique data points, n(n-1) / 2) the search is very fast. Because the graph data is stored as a matrix of adjacent nodes, setting an edge or getting the status of an edge is a constant time operation. All in all, the program works fairly well.

Of course, those numbers are from running a fairly fast system :)

2 Comments | Tags: computers