File IO#
Next to screen IO, input from and output to files is the most basic operation related to data processing. Almost all data science projects start with reading data from one or more files. In this chapter we discuss basic file access. Later on there will be several specialized modules and functions to make things more straight forward. But from time to time, in case of uncommon file formats, one has to resort to the most basic operations.
Basics#
Reading data from a file or writing data to a file requires three steps:
1. Open the file
Tell the operating system, that file access is required. The operating system checks permissions and, if everything is okay, returns a file identifier (usually a number), which has to be used for all subsequent file operations.
2. Read or write data
Tell the operating system to move data between the file and some place in memory which can be accessed by the Python interpreter.
3. Close the file
Tells the operating system, that file access is no longer required. The operating system, thus, knows that other applications now may read from or write to the file.
In Python all file related data and operations are encapsulated into a file object. There are different types of file objects depending on the file type (text or binary) and on some technical issues. All types of file objects provide identical member functions for reading and writing. Here is the basic procedure:
f = open('testdir/testfile.txt', 'r')
file_content = f.read()
f.close()
print(file_content)
Some text
in some file
splitted over
multiple lines.
This code snipped opens a file for reading (argument 'r') and assigns the name f to the resulting file object. Then the whole content of the file is stored in the string object file_content. Finally, the file is closed and it’s content is printed to screen.
If something goes wrong, for instance the file does not exist, the Python interpreter stops execution with an error message. For the moment, we do not do any error checking when operating with files (this is very bad practice!).
Note
The read method and all other methods for reading and writing files can be used to process text data and binary data. Providing the 'r' argument to open tells Python to open the file as text file. Reading data from the file results in a string object.
If ‘rb' is used instead, then the file is handled as binary file and reading results in a list of bytes.
Details will be discussed in the chapter on Text Files.
Default mode is 'r'. So specifying no mode opens for reading in text mode.
Important methods for reading and writing files are read, readline, readlines, write, writelines, seek. See methods of file objects in the Python documentation.
For more details on access modes see documention of open.
Paths are OS Dependent#
Paths to a file are operating system dependent. Thus, using paths in the open function makes our code operating system dependent. This should be avoided and luckily there are techniques to avoid such OS dependence.
Linux/Unix/macOS
In Linux and other Unix like systems (macOS for instance), all files can be accessed via paths of the form ‘/directory/subdirs/file’. That is, a list of directory names separated by slashs and ending with the file name. If the path starts with a slash, then it’s an absolute path, else a relative one.
Drives can be mounted as directory everythere in the file system’s hierarchy. Thus, there is no need for special drive related path components.
Windows
Windows uses a different format: 'drive:\directory\subdirs\file'. Instead of slashs backslashs are used as delimiters and there is an additional drive letter in absolute paths followed by a colon. The purpose of the drive letter is to select one of several physical (or even logical) drives.
From the programmer’s point of view, additional effort is required to make code work in both worlds.
OS Independent Paths in Python#
The Python module os.path provides the function join. This function takes directory names and a file name as arguments and returns a string containing the corresponding path with appropriate (OS dependent) delimiters. So output of the follow code snipped depends on the OS used for execution.
import os.path
test_path = os.path.join('testdir', 'testfile.txt')
print(test_path)
testdir/testfile.txt
The path separator (/ or \) used by the OS is available in os.sep.
import os
print('path separator:', os.sep)
path separator: /
Directory Listings#
Often data sets are scattered over many files, for instance one file per customer, each file containing all the customers transactions in an online shop. In such cases we need to get a list of all files in a specified directory. Such functionality is provided by Python’s glob module.
import glob
file_list = glob.glob('testdir/*')
for file in file_list:
print(file)
testdir/test.zip
testdir/testwrite-windows.txt
testdir/iso8859-1.txt
testdir/umlauts.txt
testdir/testfile.txt
testdir/testwrite.txt
testdir/utf-8.txt
The glob module’s glob function takes a path containing wildcards like * (arbitrary string) and ? (arbitrary character), for instance, and returns a list of all files matching the specified path.
Note
To make above code snipped OS independent we should write glob.glob(os.path.join('testdir', '*')).