I had considered putting a blog in this space, but decided that that would be too structured and that I would be obligated to update it regularly. I do occasionally have to urge to write something down or share a great recipe. This page won't be updated regularly but it should fill with interesting tidbits eventually. There will almost certainly not be any overarching theme, and I reserve the right to post about personal, culinary, philosophical, and highly technical (research related) issues.
In performance computing, there are times when the size of the problem you can treat is limited not by computing speed, but by your machine's total memory. This is the case for parts of the map making code I am developing. For linear algebra, I'm using AMD graphics cards and the acml-gpu libraries to speed things up, and am now completely limited by the amount of memory on the machine. I want to hold off working on a cluster as long as possible because waiting in a queue and transferring data back and forth from the cluster would slow development. Just to put some numbers to things, the server I'm using has 144 GB of memory and 4 GPUs meaning I can deal with square matrices of order 100 000 on the side and each operation takes about an hour (in double precision).
I have two tricks for stretching my memory further. This first is to use memory maps. A memory map is essentially a file on disk that your machine treats as memory. Now, your hard disk is a lot slower than main memory, but the beauty of a memory map is that the parts of the file that you are currently using get cached in memory, and so access and assignment are fast. This works great if you have a lot of operations to perform at a time on a small part of a huge piece of data. However, this requires you to be very careful. If you access your data too sporadically, the program will grind to a halt. Still I've found that memory maps are unbelievably forgiving, and tend to perform quite a bit better than you would initially expect. The memory map functionality is available to python users through numpy, and is trivial to use.
My second trick is much more subtle, and relies on modern operating systems being virtual memory systems. In a virtual memory system, physical memory is not allocated when a process allocates (mallocs) memory. Instead, virtual memory is. Physical memory is only allocated when the virtual memory is used. So when programming, you can feel free to allocate as much memory as you want, as long as you don't use it. This is useful for me because the matrices I deal with are symmetric. The linear algebra routines I'm using (standard LAPACK routines such as a Cholesky decomposition) never touch the lower triangular part of the matrix. So, what I do is allocate a huge chunk of memory, only fill half of it, and have matrix operations performed in place. On the machine I'm using, the maximum matrix size I can work with is 180 000 on each side, which takes about 6 hours. Not bad for not having to use a cluster.
There is a slight complication in that numpy will not allow you to allocate more memory than your machine has (using the command np.empty). I had to hack together some cython code so I could malloc the memory in C and convert it to a numpy array. I've posted the code below, feel free to use it, change it, whatever.
Here is a matrix trick that I'm sure I wasn't taught in school. Let me prefix everything by saying that I'm currently working on a map making code for the 21 cm experiment I'm involved in. Map making is big business in the CMB community and I'm trying to translate what is used there to 21 cm. Map making is all about linear algebra and dealing with large matrices and one of the big challenges is inverting some of these matrices.
The binomial inverse theorem (a.k.a. the Woodbury matrix identity or the Sherman-Morrison-Woodbury formula) tells you how to invert sums of matrices for which you already know the inverses. For a general pair of matrices it isn't that useful, since applying the formula is as much work as computing an inverse the normal way. Where the theorem comes in handy is when one of the matrices is of low rank. Lets say you have an (n*n) matrix A for which you already have the inverse. You have another matrix B that is the same size but only has rank m which is a relatively small number. The inverse of their sum can be computed in O(m*n2) operations instead of the normal O(n3). This is assuming you already know how to rotate B to an (m*m) dense matrix, but this is probably how you got B in the first place (or else why would it have such a low rank?).
All fine and good, but I still need to invert A so I haven't gained anything. This brings us to where the theorem is really useful; if A of is full rank but is diagonal or block diagonal. This symmetry lets you invert and multiply by A quickly but does nothing for A+B unless you use binomial inverse theorem. This might seem like a pretty limited range of applicability but this situation has come up no less than 3 times in my map making code.
An example is instructive. Lets say you have a set of n random numbers arranged in a vector d. Additionally, these numbers are uncorrelated, that is their covariance matrix D is diagonal, and D-1 is trivially computed. Now lets say you want to add a highly correlated component c = d + sv, where s is a random number with variance S and v is some projection vector. The new covariance matrix C = D + vSvT is dense but its inverse can still be computed trivially: C-1 = D-1 - D-1v(S-1 + vTD-1v)-1vTD-1 . This example is somewhat simplistic but is vaguely representative of the kind of situation that comes up all the time in cosmology.
Now for the catch. For the most part the identity is numerically fairly stable. However a problem arises when the update term (S in the above example) is much larger than the corresponding part of the original matrix (vTDv). I will try to motivate the problem using the above example, but if your not into statistics you can skip to the bottom of the paragraph. First consider the quantities vTDv and vTD-1v. These are the variance and information for mode v before we added the extra variance S. Now when you add variance, you subtract information, that is, vTC-1v < S-1 << vTD-1v. So here we see the problem. If S is large, then the update going from D-1 to C-1 needs to subtract out and nearly cancel all the information about mode v. This is numerical disaster and tends to make your matrices cease to be positive definite. Bottom line, the identity is numerically unstable if S >> vTDv (if v isn't normalized you need a factor of the normalization squared on the right hand side). The generalization for multiple modes (v a rectangular matrix instead of a vector) is obvious. Despite a pretty lengthy search for any reference or warning about using the identity, I had to learn this the hard way (weeks of debugging).