Files in Put Out Put
Files in Put Out Put
In this tutorial, you'll learn about Python file operations. More specifically, opening a file, reading
from it, writing into it, closing it, and various file methods that you should be aware of.
Files
Files are named locations on disk to store related information. They are used to permanently store
data in a non-volatile memory (e.g. hard disk).
Since Random Access Memory (RAM) is volatile (which loses its data when the computer is turned
off), we use files for future use of the data by permanently storing them.
When we want to read from or write to a file, we need to open it first. When we are done, it needs to
be closed so that the resources that are tied with the file are freed.
1. Open a file
2. Read or write (perform operation)
3. Close the file
Python has a built-in open() function to open a file. This function returns a file object, also called
a handle, as it is used to read or modify the file accordingly.
We can specify the mode while opening a file. In mode, we specify whether we want to read r ,
write w or append a to the file. We can also specify if we want to open the file in text mode or
binary mode.
The default is reading in text mode. In this mode, we get strings when reading from the file.
On the other hand, binary mode returns bytes and this is the mode to be used when dealing with
non-text files like images or executable files.
Mode Description
Opens a file for writing. Creates a new file if it does not exist or truncates the file if it
w
exists.
x Opens a file for exclusive creation. If the file already exists, the operation fails.
Opens a file for appending at the end of the file without truncating it. Creates a new file if
a
it does not exist.
Unlike other languages, the character a does not imply the number 97 until it is encoded using
ASCII (or other equivalent encodings).
Moreover, the default encoding is platform dependent. In windows, it is cp1252 but utf-8 in
Linux.
So, we must not also rely on the default encoding or else our code will behave differently in different
platforms.
Hence, when working with files in text mode, it is highly recommended to specify the encoding type.
When we are done with performing operations on the file, we need to properly close the file.
Closing a file will free up the resources that were tied with the file. It is done using the close()
method available in Python.
Python has a garbage collector to clean up unreferenced objects but we must not rely on it to close
the file.
This method is not entirely safe. If an exception occurs when we are performing some operation
with the file, the code exits without closing the file.
1 try:
2 f = open("test.txt", encoding = 'utf-8')
3 # perform file operations
4 finally:
5 f.close()
This way, we are guaranteeing that the file is properly closed even if an exception is raised that
causes program flow to stop.
The best way to close a file is by using the with statement. This ensures that the file is closed
when the block inside the with statement is exited.
In order to write into a file in Python, we need to open it in write w , append a or exclusive creation
x mode.
We need to be careful with the w mode, as it will overwrite into the file if it already exists. Due to
this, all the previous data are erased.
Writing a string or sequence of bytes (for binary files) is done using the write() method. This
method returns the number of characters written to the file.
This program will create a new file named test.txt in the current directory if it does not exist. If it
does exist, it is overwritten.
We must include the newline characters ourselves to distinguish the different lines.
There are various methods available for this purpose. We can use the read(size) method to read
in the size number of data. If the size parameter is not specified, it reads and returns up to the end
of the file.
We can read the text.txt file we wrote in the above section in the following way:
We can change our current file cursor (position) using the seek() method. Similarly, the tell()
method returns our current position (in number of bytes).
We can read a file line-by-line using a for loop. This is both efficient and fast.
In this program, the lines in the file itself include a newline character \n . So, we use the end
parameter of the print() function to avoid two newlines when printing.
Alternatively, we can use the readline() method to read individual lines of a file. This method
reads a file till the newline, including the newline character.
1 >>> f.readline()
2 'This is my first file\n'
3
4 >>> f.readline()
5 'This file\n'
6
7 >>> f.readline()
8 'contains three lines\n'
9
10 >>> f.readline()
11 ''
Lastly, the readlines() method returns a list of remaining lines of the entire file. All these reading
methods return empty values when the end of file (EOF) is reached.
1 >>> f.readlines()
2 ['This is my first file\n', 'This file\n', 'contains three lines\n']
There are various methods available with the file object. Some of them have been used in the above
examples.
Here is the complete list of methods in text mode with a brief description:
Method Description
close() Closes an opened file. It has no effect if the file is already closed.
Reads at most n characters from the file. Reads till end of file if it
read(n)
is negative or None .
readable() Returns True if the file stream can be read from.
Reads and returns one line from the file. Reads in at most n bytes
readline(n=-1)
if specified.
Reads and returns a list of lines from the file. Reads in at most n
readlines(n=-1)
bytes/characters if specified.
seek(offset,from= SEEK_SET Changes the file position to offset bytes, in reference to from
) (start, current, end).
Writes the string s to the file and returns the number of characters
write(s)
written.
Python Directory
If there are a large number of files to handle in our Python program, we can arrange our code within
different directories to make things more manageable.
A directory or folder is a collection of files and subdirectories. Python has the os module that
provides us with many useful methods to work with directories (and files as well).
This method returns the current working directory in the form of a string. We can also use the
getcwdb() method to get it as bytes object.
1 >>> import os
2
3 >>> os.getcwd()
4 'C:\\Program Files\\PyScripter'
5
6 >>> os.getcwdb()
7 b'C:\\Program Files\\PyScripter'
The extra backslash implies an escape sequence. The print() function will render this properly.
1 >>> print(os.getcwd())
2 C:\Program Files\PyScripter
Changing Directory
We can change the current working directory by using the chdir() method.
The new path that we want to change into must be supplied as a string to this method. We can use
both the forward-slash / or the backward-slash \ to separate the path elements.
1 >>> os.chdir('C:\\Python33')
2
3 >>> print(os.getcwd())
4 C:\Python33
This method takes in a path and returns a list of subdirectories and files in that path. If no path is
specified, it returns the list of subdirectories and files from the current working directory.
1 >>> print(os.getcwd())
2 C:\Python33
3
4 >>> os.listdir()
5 ['DLLs',
6 'Doc',
7 'include',
8 'Lib',
9 'libs',
10 'LICENSE.txt',
11 'NEWS.txt',
12 'python.exe',
13 'pythonw.exe',
14 'README.txt',
15 'Scripts',
16 'tcl',
17 'Tools']
18
19 >>> os.listdir('G:\\')
20 ['$RECYCLE.BIN',
21 'Movies',
22 'Music',
23 'Photos',
24 'Series',
25 'System Volume Information']
This method takes in the path of the new directory. If the full path is not specified, the new directory
is created in the current working directory.
1 >>> os.mkdir('test')
2
3 >>> os.listdir()
4 ['test']
For renaming any directory or file, the rename() method takes in two basic arguments: the old
name as the first argument and the new name as the second argument.
1 >>> os.listdir()
2 ['test']
3
4 >>> os.rename('test','new_one')
5
6 >>> os.listdir()
7 ['new_one']
1 >>> os.listdir()
2 ['new_one', 'old.txt']
3
4 >>> os.remove('old.txt')
5 >>> os.listdir()
6 ['new_one']
7
8 >>> os.rmdir('new_one')
9 >>> os.listdir()
10 []
1 >>> os.listdir()
2 ['test']
3
4 >>> os.rmdir('test')
5 Traceback (most recent call last):
6 ...
7 OSError: [WinError 145] The directory is not empty: 'test'
8
9 >>> import shutil
10
11 >>> shutil.rmtree('test')
12 >>> os.listdir()
13 []
Python RegEx
Video 1
Video 2
In this tutorial, you’ll explore regular expressions, also known as regexes, in Python. A regex is a
special sequence of characters that defines a pattern for complex string-matching functionality.
Earlier in this series, in the tutorial Strings and Character Data in Python, you learned how to define
and manipulate string objects. Since then, you’ve seen some ways to determine whether two strings
match each other:
You can test whether two strings are equal using the equality ( == ) operator.
You can test whether one string is a substring of another with the in operator or the built-in
string methods .find() and .index() .
String matching like this is a common task in programming, and you can get a lot done with string
operators and built-in methods. At times, though, you may need more sophisticated pattern-
matching capabilities.
Fasten your seat belt! Regex syntax takes a little getting used to. But once you get comfortable with
it, you’ll find regexes almost indispensable in your Python programming.
Imagine you have a string object s . Now suppose you need to write Python code to find out
whether s contains the substring '123' . There are at least a couple ways to do this. You could
use the in operator:
1 >>> s = 'foo123bar'
2 >>> '123' in s
3 True
If you want to know not only whether '123' exists in s but also where it exists, then you can use
.find() or .index() . Each of these returns the character position within s where the substring
resides:
1 >>> s = 'foo123bar'
2 >>> s.find('123')
3 3
4 >>> s.index('123')
5 3
For example, rather than searching for a fixed substring like '123' , suppose you wanted to
determine whether a string contains any three consecutive decimal digit characters, as in the
strings 'foo123bar' , 'foo456bar' , '234baz' , and 'qux678' .
Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.
A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For
example,
^a...s$
The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and
ending with s.
alias Match
Alias No match
An abacus No match
1 import re
2
3 pattern = '^a...s$'
4 test_string = 'abyss'
5 result = re.match(pattern, test_string)
6
7 if result:
8 print("Search successful.")
9 else:
10 print("Search unsuccessful.")
Here, we used re.match() function to search pattern within the test_string. The method returns a
match object if the search is successful. If not, it returns None .
To specify regular expressions, metacharacters are used. In the above example, ^ and $ are
metacharacters.
MetaCharacters
Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list
of metacharacters:
[] . ^ $ * + ? {} () \ |
[] - Square brackets
Square brackets specifies a set of characters you wish to match.
[abc] a 1 match
[abc] ac 2 matches
Here, [abc] will match if the string you are trying to match contains any of the a , b or c .
You can also specify a range of characters using - inside square brackets.
You can complement (invert) the character set by using caret ^ symbol at the start of a square-
bracket.
. - Period
.. a No match
.. ac 1 match
.. acd 1 match
.. acde 2 matches (contains 4 characters)
^ - Caret
The caret symbol ^ is used to check if a string starts with a certain character.
^a a 1 match
^a abc 1 match
^a bac No match
$ - Dollar
The dollar symbol $ is used to check if a string ends with a certain character.
a$ a 1 match
a$ formula 1 match
a$ cab No match
* - Star
The star symbol * matches zero or more occurrences of the pattern left to it.
+ - Plus
The plus symbol + matches one or more occurrences of the pattern left to it.
? - Question Mark
The question mark symbol ? matches zero or one occurrence of the pattern left to it.
ma?n mn 1 match
{} - Braces
Consider this code: {n,m} . This means at least n, and at most m repetitions of the pattern left to
it.
Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than
4 digits
| - Alternation
() - Group
Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that
matches either a or b or c followed by xz
(a|b|c)xz ab xz No match
\ - Backslash
Backlash \ is used to escape various characters including all metacharacters. For example,
\$a match if a string contains $ followed by a . Here, $ is not interpreted by a RegEx engine in
a special way.
If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes
sure the character is not treated in a special way.
Special Sequences
Special sequences make commonly used patterns easier to write. Here's a list of special
sequences:
\B - Opposite of \b . Matches if the specified characters are not at the beginning or end of a
word.
\d Python No match
\D 1345 No match
\s PythonRegEx No match
[^ \t\n\r\f\v] .
\S a b 2 matches (at a b )
\S No match
\w %"> ! No match
\W Python No match
Tip: To build and test regular expressions, you can use RegEx tester tools such as
regex101. This tool not only helps you in creating regular expressions, but it also helps
you learn it.
Now you understand the basics of RegEx, let's discuss how to use RegEx in your Python code.
Python RegEx
Python has a module named re to work with regular expressions. To use it, we need to import the
module.
import re
The module defines several functions and constants to work with RegEx.
re.findall()
re.split()
The re.split method splits the string where there is a match and returns a list of strings where
the splits have occurred.
1 import re
2
3 string = 'Twelve:12 Eighty nine:89.'
4 pattern = '\d+'
5
6 result = re.split(pattern, string)
7 print(result)
8
9 # Output: ['Twelve:', ' Eighty nine:', '.']
If the pattern is not found, re.split() returns a list containing the original string.
You can pass maxsplit argument to the re.split() method. It's the maximum number of splits
that will occur.
1 import re
2
3 string = 'Twelve:12 Eighty nine:89 Nine:9.'
4 pattern = '\d+'
5
6 # maxsplit = 1
7 # split only at the first occurrence
8 result = re.split(pattern, string, 1)
9 print(result)
10
11 # Output: ['Twelve:', ' Eighty nine:89 Nine:9.']
By the way, the default value of maxsplit is 0; meaning all possible splits.
re.sub()
The method returns a string where matched occurrences are replaced with the content of replace
variable.
You can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This
will replace all occurrences.
1 import re
2
3 # multiline string
4 string = 'abc 12\
5 de 23 \n f45 6'
6
7 # matches all whitespace characters
8 pattern = '\s+'
9 replace = ''
10
11 new_string = re.sub(r'\s+', replace, string, 1)
12 print(new_string)
13
14 # Output:
15 # abc12de 23
16 # f45 6
re.subn()
The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new
string and the number of substitutions made.
re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the
first location where the RegEx pattern produces a match with the string.
If the search is successful, re.search() returns a match object; if not, it returns None .
1 import re
2
3 string = "Python is fun"
4
5 # check if 'Python' is at the beginning
6 match = re.search('\APython', string)
7
8 if match:
9 print("pattern found inside the string")
10 else:
11 print("pattern not found")
12
13 # Output: pattern found inside the string
You can get methods and attributes of a match object using dir() function.
Some of the commonly used methods and attributes of match objects are:
match.group()
The group() method returns the part of the string where there is a match.
1 import re
2
3 string = '39801 356, 2102 1111'
4
5 # Three digit number followed by space followed by two digit number
6 pattern = '(\d{3}) (\d{2})'
7
8 # match variable contains a Match object.
9 match = re.search(pattern, string)
10
11 if match:
12 print(match.group())
13 else:
14 print("pattern not found")
15
16 # Output: 801 35
Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}) . You can get the part
of the string of these parenthesized subgroups. Here's how:
1 >>> match.group(1)
2 '801'
3
4 >>> match.group(2)
5 '35'
6 >>> match.group(1, 2)
7 ('801', '35')
8
9 >>> match.groups()
10 ('801', '35')
The start() function returns the index of the start of the matched substring. Similarly, end()
returns the end index of the matched substring.
1 >>> match.start()
2 2
3 >>> match.end()
4 8
The span() function returns a tuple containing start and end index of the matched part.
1 >>> match.span()
2 (2, 8)
The re attribute of a matched object returns a regular expression object. Similarly, string
attribute returns the passed string.
1 >>> match.re
2 re.compile('(\\d{3}) (\\d{2})')
3
4 >>> match.string
5 '39801 356, 2102 1111'
We have covered all commonly used methods defined in the re module. If you want to
learn more, visit Python 3 re module.