Memory-Mapped (mmap) File Support in Python

What is a Memory-Mapped File in Python

From Python’s official documentation, be sure to checkout Python’s mmap module:

A memory-mapped file object behaves like both strings and like file objects. Unlike normal string objects, however, these are mutable.

Basically, a memory-mapped (using Python’s mmap module) file object maps a normal file object into memory. This allows you to modify a file object’s content directly in memory. Since a memory-mapped file object also behaves like a mutable string object, you can modify the content of a file object like you modify the content of a list of characters:

  • obj[1] = 'a' – Assigns a character ‘a’ to the second character of the file object’s content.
  • obj[1:4] = 'abc' – Assigns a character list ‘abc’ to a range of three characters starting at the second character of the file object’s content.

In a nutshell, memory-mapping a file with Python’s mmap module us use the operating system’s virtual memory to access the data on the filesystem directly. Instead of making system calls such as openread and lseek to manipulate a file, memory-mapping puts the data of the file into memory which allows you to directly manipulate files in memory. This greatly improves I/O performance.

Comparison of Memory-Mapped Files vs Normal Files with Python

Assume we have a binary file test.out that is larger than 10 MB and there’s a certain kind of algorithm that requires us to process the data of the file in such a manner that needs us to repeat the process of:

  • From the current position, seek 64 bytes and process the data at the beginning of current position.
  • From the current position, seek -32 bytes and process the data at the beginning of the current position.

The actual process of the data is replaced by a pass statement since it does not affect the relative performance comparison between mmap and normal file access. The algorithm keeps processing the data of the file until it reaches a point that is larger than 10MB. The code that uses the normal file objects to perform the algorithm is listed in the file normal_process.py:

import os
import time
 
f = open('test.out', 'r')
buffer_size = 64
retract_size = -32
start_time = time.time()
while True:
    f.seek(buffer_size, os.SEEK_CUR)
    # Process some data starting at the current position
    pass
    f.seek(retract_size, os.SEEK_CUR)
    # Process some data starting at the current position
    pass
    if f.tell() > 1024 * 1024 * 10:
        break
end_time = time.time()
f.close()
print('Normal time elapsed: {0}'.format(end_time - start_time))

The code that uses mmap to process the data of the file is listed in the file mmap_process.py:

import os
import time
import mmap
 
f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
buffer_size = 64
retract_size = -32
start_time = time.time()
while True:
    m.seek(buffer_size, os.SEEK_CUR)
    # process some data starting at the current position
    pass
    m.seek(retract_size, os.SEEK_CUR)
    # process some data starting at the current position
    pass
    if m.tell() > 1024 * 1024 * 10:
        break
end_time = time.time()
m.close()
f.close()
print('mmap time elapsed: {0}'.format(end_time - start_time))

Now you can compare normal_process.py and mmap_process.py in a simple for-loop in your shell:

for i in {1..3}
do
    python normal_process.py
    python mmap_process.py
done
normal time elapsed: 0.355199098587
mmap time elapsed: 0.296804904938
normal time elapsed: 0.371860027313
mmap time elapsed: 0.290856838226
normal time elapsed: 0.355377197266
mmap time elapsed: 0.305727958679

As you can see, mmap_process.py is about 17% faster than normal_process.py on average, because the seek function calls are performed directly against the memory in mmap_process.py while they are performed using filesystem calls in normal_process.py.

Modify a Memory-Mapped File in Python with mmap

In the following file mmap_write.py, we modify the content of a file write_test.txt using mmap and flush the changes back to disk:

import os
import mmap
 
f = open('write_test.txt', 'a+b')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE)
m[0] = 'n'
# Flush changes made to the in-memory copy of the file back to disk
m.flush()
m.close()
f.close()

Suppose write_test.txt contains one line of text:

$ cat write_test.txt
mmap is a cool feature!

After we run python mmap_write.py, its content will change to (note the first character of the sentence):

$ cat write_test.txt
nmap is a cool feature!

mmap Summary and Suggestions

Although mmap is a cool feature, keep in mind that mmap has to find a contiguous block of addresses in your process’s address space that is large enough to fit an entire file object. Suppose you’re working with large files on a system that does not have enough continuous memory region available to fit those files, then creating a mmap will fail. In additions, mmap does not work on certain special file objects like pipes and ttys.

The following list summarizes when you should use mmap:

  • During multithreaded programming, if you have multiple processes accessing data in a read only way from the same file, then using mmap can save you lots of memory.
  • mmap allows the operating system to optimize paging operations, which enables the program’s memory in pages to be efficiently re-used by the operating system.

Leave a Reply

Your email address will not be published. Required fields are marked *