What is a Memory-Mapped File in Python
From Python’s official documentation, be sure to checkout Python’s mmap module:
A memory-mapped file object behaves like both strings and like file objects. Unlike normal string objects, however, these are mutable.
Basically, a memory-mapped (using Python’s mmap
module) file object maps a normal file object into memory. This allows you to modify a file object’s content directly in memory. Since a memory-mapped file object also behaves like a mutable string object, you can modify the content of a file object like you modify the content of a list of characters:
obj[1] = 'a'
– Assigns a character ‘a’ to the second character of the file object’s content.obj[1:4] = 'abc'
– Assigns a character list ‘abc’ to a range of three characters starting at the second character of the file object’s content.
In a nutshell, memory-mapping a file with Python’s mmap
module us use the operating system’s virtual memory to access the data on the filesystem directly. Instead of making system calls such as open, read and lseek to manipulate a file, memory-mapping puts the data of the file into memory which allows you to directly manipulate files in memory. This greatly improves I/O performance.
- Python Programming – Scope and Module
- Python Programming – File Functions
- Python Programming – File Processing
Comparison of Memory-Mapped Files vs Normal Files with Python
Assume we have a binary file test.out that is larger than 10 MB and there’s a certain kind of algorithm that requires us to process the data of the file in such a manner that needs us to repeat the process of:
- From the current position, seek 64 bytes and process the data at the beginning of current position.
- From the current position, seek -32 bytes and process the data at the beginning of the current position.
The actual process of the data is replaced by a pass
statement since it does not affect the relative performance comparison between mmap
and normal file access. The algorithm keeps processing the data of the file until it reaches a point that is larger than 10MB. The code that uses the normal file objects to perform the algorithm is listed in the file normal_process.py:
import os import time f = open('test.out', 'r') buffer_size = 64 retract_size = -32 start_time = time.time() while True: f.seek(buffer_size, os.SEEK_CUR) # Process some data starting at the current position pass f.seek(retract_size, os.SEEK_CUR) # Process some data starting at the current position pass if f.tell() > 1024 * 1024 * 10: break end_time = time.time() f.close() print('Normal time elapsed: {0}'.format(end_time - start_time))
The code that uses mmap
to process the data of the file is listed in the file mmap_process.py:
import os import time import mmap f = open('test.out', 'r') m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) buffer_size = 64 retract_size = -32 start_time = time.time() while True: m.seek(buffer_size, os.SEEK_CUR) # process some data starting at the current position pass m.seek(retract_size, os.SEEK_CUR) # process some data starting at the current position pass if m.tell() > 1024 * 1024 * 10: break end_time = time.time() m.close() f.close() print('mmap time elapsed: {0}'.format(end_time - start_time))
Now you can compare normal_process.py and mmap_process.py in a simple for-loop in your shell:
for i in {1..3} do python normal_process.py python mmap_process.py done normal time elapsed: 0.355199098587 mmap time elapsed: 0.296804904938 normal time elapsed: 0.371860027313 mmap time elapsed: 0.290856838226 normal time elapsed: 0.355377197266 mmap time elapsed: 0.305727958679
As you can see, mmap_process.py is about 17% faster than normal_process.py on average, because the seek
function calls are performed directly against the memory in mmap_process.py while they are performed using filesystem calls in normal_process.py.
Modify a Memory-Mapped File in Python with mmap
In the following file mmap_write.py, we modify the content of a file write_test.txt using mmap
and flush the changes back to disk:
import os import mmap f = open('write_test.txt', 'a+b') m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE) m[0] = 'n' # Flush changes made to the in-memory copy of the file back to disk m.flush() m.close() f.close()
Suppose write_test.txt contains one line of text:
$ cat write_test.txt mmap is a cool feature!
After we run python mmap_write.py, its content will change to (note the first character of the sentence):
$ cat write_test.txt nmap is a cool feature!
mmap Summary and Suggestions
Although mmap
is a cool feature, keep in mind that mmap
has to find a contiguous block of addresses in your process’s address space that is large enough to fit an entire file object. Suppose you’re working with large files on a system that does not have enough continuous memory region available to fit those files, then creating a mmap
will fail. In additions, mmap
does not work on certain special file objects like pipes and ttys.
The following list summarizes when you should use mmap
:
- During multithreaded programming, if you have multiple processes accessing data in a read only way from the same file, then using
mmap
can save you lots of memory. mmap
allows the operating system to optimize paging operations, which enables the program’s memory in pages to be efficiently re-used by the operating system.