Persistent Transactional Memory

Instead of continuously serializing and deserializing data to disk/to a database, it is often easier to work with a shared memory that maps into a file. Of course, the drawback of this approach is that recovering from crashes is not that straightforward since we never know quite exactly in what state the memory was when the program quited (was killed, or crashed). To solve that, it is better to work with a file that will stores only stable memory states. Every time the source code has a 'commit' statement, will the memory be fully committed to disk or not (and a crash will ensue).

Bug Tracker choose the component 'TxMem'. Don't forget to check out closed bugs.
Source code documentation - generated with doxygen
Source - The package is called txmem-<version>.tgz
Contact - werner@yellowcouch.org

Technically

Technically the transactional memory is created using a file backed MAP_PRIVATE map. Such a map has the property that pages are loaded in memory when accessed and that pages are copied when written to. Consequently, the underlying file is never modified directly.

Once a commit is asked for, the program will run through memory, select all 'dirty' pages and write them back top file. For those who believe that a plain MAP_SHARED and msync could have been useful: it is not guaranteed that a memory page is not flushed to disk before calling msync; which would lead to an invalid state.

The library works by saving all memory pages that ought to be saved to a journal file, after which the pages are placed in the real memory file, after which again the journal is removed.

Features

  • Atomic delivery of all modified memory to the underlying file
  • The library has been tested with random failures during the commit and journal restore phase.
  • Every time the memory is loaded, will it be at one of the 'commit' points
  • When the program crashed before a first commit, will the file be empty

Warnings

  • Does not mind multi threading, as long as the 'commit' request indeed happens at a stable state. So all your threads should be synchronized.
  • Is vulnerable to forks, since the mmap will be carried from one fork to the next.
  • Since you work with shared memory, it might be necessary to use 'smarter' pointers that rely on offsets from one object to another instead of hard addresses. Offset pointers offers such a possibility
  • If the memory should be exchangeable from one machine to another, make sure your integers are byte-order consistent.
  • Requires at least kernel 2.6.31 
  • Requires access to the /proc/kpageflags file; either this file is made readable for everybody (chmod 444 /proc/kpageflags), or the program you write runs as root.

Demo Use

int main(int argc, char* argv[])
{
TxMemory Memory("data",10000);
char* M=Memory.M;
M[0]='a';
/* crashpoint 1*/
M[4096*2]='b';
commit();
M[4096]='t'
/* crashpoint 2 */
commit();
}
This program will either be in a state where
  1. the memory is uninitialized
  2. M[0]=='a' and M[8192]=='b'
  3. M[0]=='a' and M[8192]=='b' and M[4096]=='t'
If the program were to crash at crash point 1, will it automatically be in the previous state (being not initialized) when the program restarts.
If the program were to crash at crash point 2, will it automatically be in the state where both address 0 and 8192 have been written, but not address 4096.

Implementation Details

One of the things that was remarkably hard to figure out, was how the Linux kernel would tell us which pages have been written to. After a question on stackoverflow, it became apparent that this is indeed not that straightforward.

Following the advice of MarkR, I learned that one has to go through /proc/selfpagemap and /proc/kpageflags. Then of course, it became interesting since one must figure out how the kernel flags written to pages. Interestingly, this has nothing to do with the protection level directly. Actually for the entire MAPPRIVATE file, the protection level is 'writable'. Instead, the pageflag is marked as something either in swap space) or as 'in memory), which is the bit SWAPBACKED. Interestingly, I also learned that not all pages must have the same page size, which is a very entertaining problem if it would ever come to bite this library. For now, we assume that all pages in the file have the same size.

int main(int argc, char* argv[])
{
unsigned long long pagesize=getpagesize();
assert(pagesize>0);
int pagecount=4;
int filesize=pagesize*pagecount;
int fd=open("test.dat", O_RDWR);
if (fd<=0)
{
fd=open("test.dat", O_CREAT|O_RDWR,S_IRUSR|S_IWUSR);
printf("Created test.dat testfile\n");
}
assert(fd);
int err=ftruncate(fd,filesize);
assert(!err);

char* M=(char*)mmap(NULL, filesize, PROT_READ|PROT_WRITE, MAP_PRIVATE,fd,0);
assert(M!=(char*)-1);
assert(M);
printf("Successfully create private mapping\n");

The test setup contains 4 pages. page 0 and 2 are dirty

  strcpy(M,"I feel so dirty\n");
strcpy(M+pagesize*2,"Christ on crutches\n");

page 3 has been read from.

  char t=M[pagesize*3];

page 1 will not be accessed

The pagemap file maps the process its virtual memory to actual pages, which can then be retrieved from the global kpageflags file later on. Read the file /usr/src/linux/Documentation/vm/pagemap.txt

  int mapfd=open("/proc/self/pagemap",O_RDONLY);
assert(mapfd>0);
unsigned long long target=((unsigned long)(void*)M)/pagesize;
err=lseek64(mapfd, target*8, SEEK_SET);
assert(err==target*8);
assert(sizeof(long long)==8);

Here we read the page frame numbers for each of our virtual pages

  unsigned long long page2pfn[pagecount];
err=read(mapfd,page2pfn,sizeof(long long)*pagecount);
if (err<0)
perror("Reading pagemap");
if(err!=pagecount*8)
printf("Could only read %d bytes\n",err);

Now we are about to read for each virtual frame, the actual pageflags

  int pageflags=open("/proc/kpageflags",O_RDONLY);
assert(pageflags>0);
for(int i = 0 ; i < pagecount; i++)
{
unsigned long long v2a=page2pfn[i];
printf("Page: %d, flag %llx\n",i,page2pfn[i]);

if(v2a&0x8000000000000000LL) // Is the virtual page present ?
{
unsigned long long pfn=v2a&0x3fffffffffffffLL;
err=lseek64(pageflags,pfn*8,SEEK_SET);
assert(err==pfn*8);
unsigned long long pf;
err=read(pageflags,&pf,8);
assert(err==8);
printf("pageflags are %llx with SWAPBACKED: %d\n",pf,(pf>>14)&1);
}
}
while(true) sleep(1);
}

Bootnote on the kernel interface

All in all, I'm not particularly happy with this convoluted approach to figure out whether a page has been written to or not ! How about a simple kernel call to retrieve the appropriate pageflags ?