About the Author

me

I am a 24 year old Computer Science student at University of New Hampshire. I'm graduating in May, and currently searching for full time jobs. You can find my resume along with other info about me on my personal page: Daniel P. Noe.

 
-->

23 March 2008 - 11:12Implementing offsetof() as a macro

File this under neat/stupid C tricks:


/* From linux/include/stddef.h */
#ifdef __compiler_offsetof
#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
#else
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#endif

The offsetof(t,m) macro is supposed to return the offset of member m in type (structure) t. This is useful for a variety of things, particular finding the address of the beginning of a structure given the address of any element in it.

Some compilers implement offsetof() natively and the code above will use the compiler’s offsetof() if it exists. But if it doesn’t it performs the pointer arithmetic tricks seen above. It works by taking address 0 and casting it to a pointer (pointer to whatever structure we care about). Then it uses the structure element operator -> to obtain the value of member, then applies the address-of operator to obtain the address of the structure member relative to a structure beginning at address zero, and thus relative to the beginning of the structure. Then it simply casts this value into a generic size_t, indicating an address offset. Note that while this code appears to dereference a null (0) pointer, it only applies the address-of operator to the result so no actual dereferencing is performed at runtime.

The Linux kernel uses this in its linked list implementation. The doubly linked list implementation uses a structure called list_head which contains forward and backward pointers. To use the linked list, you simply need to include the list_head as an element anywhere in whatever structure you want to use as the data associated with the list. You can even include it multiple times, and the data can be a member of multiple lists. You manipulate the list using the list_head structures and when you want a pointer to the data you simply use the list_entry() macro which uses offsetof() to determine where the list_head exists within the data structure then subtracts it to yield the beginning of the structure. Neat and elegant!

For details about the kernel linked list, check out Linux Kernel Linked List Explained, which does a good job describing the various functions which allow easy linked list manipulation.

No Comments | Tags: computers

22 March 2008 - 20:13Troubleshooting Defunct (Zombie) Processes on Linux

At work (LNI) we have a tool which implements a remotely controlled RDMA agent using the OpenFabrics interface. The agent is used for compliance, interoperability, and performance testing. For quite some time we’ve had a problem where sometimes the agent hangs and then after sending SIGINT (ctrl-C or kill) the agent shows up as “defunct” with a Zombie state in the output of ps.

Normally a zombie process means that the process has died but remains in the process table because the parent hasn’t called wait() to “reap” the process and retrieve the return code. If you kill the parent, the zombie process becomes parented by init (process 1) and init reaps it. But, the problem we were having was clearly not this. Killing the parent did nothing. And the defunct agent appeared to be holding onto resources. It was not possible to start a new agent since the defunct one continued to hold a socket open, which would never be true with the usual meaning of a defunct process. The only workaround was to reboot the system.

The key to the real answer is that the agent uses multiple threads with the POSIX threads library. Individual threads and processes under Linux are both viewed as tasks to the process management code. Threads are implemented using one task designated as “thread group leader” and a “thread group id” present in each task_struct. By default, the ps utility displays just the thread group leader, hiding the other tasks.

Working at the OpenFabrics Alliance interoperability event last week at UNH-IOL we once again experienced the defunct problem. This time I attacked it in earnest and discovered the “ps aumx” option (”m” being the critical one which displays threads individually). This showed the thread group leader in “Z” state, and then the key: another threads was stuck in the “D” uninterruptible sleep state. In this state the process is running in kernel mode yet cannot be interrupted by any signal, including SIGKILL (signal 9, which normally cannot be ignored). Thus, this thread is unkillable, until whatever condition it is waiting for in kernel mode is cleared. The only solution is to reboot the system.

When the SIGINT (or SIGKILL or whatever) was delivered to the agent, all the tasks (threads) received the signal. Yet one couldn’t exit because it was in “D” and the thread group leader remains around in the deceptive “Z” state, in this case indicating that the process is still around. In fact, this was confirmed by running ps aumx on the agent after a suspected hang but before attempting to kill the agent. This time there were the usual numerous threads, one of which was listed in state “D”.

So how to debug from here? It was possible to use the Magic SysRq Key to obtain a listing of the current tasks on the system. This displayed a stack trace showing the execution context of each task running in kernel mode, including the agent task stuck in uninterruptible sleep. Using this it was possible to determine what the task was doing which caused it to slip into a coma.

I wanted to make sure this got written up because I spent way to much time looking on Google and finding pages that described the usual meaning of “defunct” processes but didn’t touch on this deceptive alternate meaning. So hopefully now it’ll be found! Extra thanks to Professor Robert Russell who helped me troubleshoot all of this.

11 Comments | Tags: computers

11 March 2008 - 0:29A short introduction to RDMA

This week I am attending the InfiniBand Trade Association plugfest at UNH-IOL and next week I’ll be there again for the OpenFabrics Alliance interoperability event. The OFA is an industry alliance representing InfiniBand and iWARP, both high performance networking technologies that allow remote DMA (Direct Memory Access). For my readers who don’t know what RDMA is, I’ll try to explain.

First, a bit of background about DMA. There are many different ways to handle IO. Initially, the CPU was involved in copying data between a device and memory. This means that the CPU can’t be used for anything else during the copy. Since access to IO devices is significantly slower than the CPU (generally orders of magnitude, especially for disks), the CPU is essentially wasted on such a simple task. DMA allows a separate controller to take charge of the transfer between the IO device and main memory. The CPU initiates a transfer, and the DMA controller does the actual work, freeing the CPU for other tasks. When the DMA transfer is completed, it sends an interrupt to the CPU which alerts the operating system that the transfer is done.

In an ideal world the operating system has a number of tasks which can run on the CPU. An ideal load balances tasks which require a lot of IO and tasks which are bound by CPU (for example, many calculations with little IO). By using DMA the system an switch to CPU bound tasks and run them while the slow data transfer happens. Additionally, because the system is not overloaded with interrupts occurring constantly (for example, with a byte-by-byte non-DMA transfer) interactive tasks - such as a keypress being registered - are handled more effectively.

RDMA - Remote DMA - extends this concept to networking. Networking tends to involve a lot of data copying, and a lot of work by the operating system. Many network cards support DMA already, but this only allows a DMA transfer of raw network data from the card into the OS Kernel’s memory. This data must be decoded as it moves through the protocol stack, which typically involves additional copying, and eventually it is deliver to a user space program (which involves yet another copy). All this copying is particularly necessary since the kernel must virtualize the shared network interface to all the programs running on a system. The kernel needs to ensure that a user program cannot read network traffic meant for a different program. As network interfaces keep getting faster this overhead becomes more and more significant.

RDMA gives increased performance at the expense of security. Special network interface cards implement the protocol stack in silicon on the card. A user program can then initiate a remote direct memory-to-memory transfer. The operating system kernel is not involved in the transfer at all. This means the CPU can be busy doing computations while the (comparatively) much slower network link transfers data. This is especially useful for high performance computing. If a parallel task involves a lot of computation with message passing to synchronize, then things can be drastically sped up by using RDMA.

InfiniBand is one technology that implements RDMA. IB uses a high performance dedicated network. It is commonly used for computing clusters, since it uses dedicated InfiniBand switch hardware. iWARP is a second technology for RDMA which also RDMA transfers over standard Internet Protocol links (such as a normal local area ethernet network). It still requires special iWARP “RNIC”s which contain a hardware network stack and enable RDMA operations. But there is no need for special switch hardware, and iWARP can even be used over a wide area.

Obviously this technology is not going to show up in your home any time soon. The target market is the high performance computing market, the storage market, and financial market. Nonetheless, within this limited sector RDMA is a very exciting development.

2 Comments | Tags: computers

5 March 2008 - 21:12AOL “opens” AIM Protocol

For many years now the AOL Instant Messenger protocol OSCAR has been reverse engineered, enabling third party access to the AIM network. Open source libraries such as libpurple package this functionality, and are used by the Pidgin and Adium clients amongst others.

Recently, AOL has announced an OpenAIM initiative, billed as opening the network for third party clients. But check out the license agreement:

Additional Feature Requirements. Any Custom Client or Web AIM Developer Application that you distribute must include at least two of the following features or functionalities (”Additional Features”) as an integral part of such distributed Developer Application:

1. AIM Expressions. Inclusion of the capability for your users to choose and display a Buddy Icon to customize his or her user experience and provide a link to the AOL-Hosted AIM Expressions web page as documented in the AOL Additional Features document.
2. AIM Toolbar. Inclusion of the AIM Toolbar as a user-selected option during the registration/download/installation process for the Developer Application, as applicable.
3. AIM Start Page Launch. Inclusion of the launch of the AIM Start Page upon users. logon to your Site or to the Developer Application.
4. Buddy Info. Inclusion of content provided by AOL that includes information about a user’s online status, including the user’s AIM profile, and AOL-supplied advertising.
5. Advertisement. Inclusion of an AOL-provided display advertisement (”Advertisement”) within your Custom Client, Site or activity window. Unless otherwise provided in a written agreement, all revenue from such Advertisement will belong to AOL.

It isn’t open at all. You need to include specific functions. And even worse, if your client is used by more than 100,000 people, as determined by AOL, you must include advertising supplied by AOL. I’m glad that AOL has decided to get onto the “open” bandwagon but this is a PR move - the protocol hasn’t really been opened at all. In fact, it appears that this may even be a ploy to go after open source libraries like libpurple which currently use the reverse engineered OSCAR specification.

Google Talk is based on the open XMPP messaging protocol, also known as Jabber. It provides a true open interface, and interoperates freely with XMPP/Jabber servers operated by independent individuals (like myself). You can view the specification online without the stipulation of a license agreement. XMPP is the truly open instant message framework.

You can send me an instant message at spinfire@isomerica.net, and if you want to join XMPP users, register your own screen name at isomerica.net or Google Talk or any of the open Jabber/XMPP operators. Give it a try in the name of open specifications and interoperability!

1 Comment | Tags: computers

2 March 2008 - 21:05Some photos from the weekend

On Saturday morning we had some snow. Much to my surprise I looked out and saw a whole clan full of robins hanging out in the trees near our apartment despite snow falling. They stayed for a short while they left again. What were they thinking?!


Robins in the snow

EF 70-200 f/2.8L IS with the EF 1.4x Extender @280mm. 1/250, f/4.0, ISO 200.

We went down to Acton, MA to play 1856 (an 18xx board game). We use a moderator - software which automates some of the housekeeping for the game so play doesn’t take so long. The game took around 6 hours even with the moderator. Lucky manned the moderator for most of the game (in addition to playing):


Lucky

EF 70-200 f/2.8L IS @70mm. 1/80, f/6.3 ISO 100. With Canon 580EX II flash and Demb diffuser. Desaturated and vignetted in Lightroom for an antique look.

We played in teams in order to keep the number of players down (and make things go faster). So there was some whispering between team mates Alan and Shawn:


Alan and Shawn

EF 70-200 f/2.8L IS @70mm. 1/80, f/5.6, ISO 100. With 580EX II flash and Demb flash diffuser, with the flash head pointing up and to the right at an angle.

No Comments | Tags: photos