Chapter 7
Playing with Text and Data Files
Working with text and data files is an essential part of using Linux/Unix
systems.
These systems store most of their configuration, logs, and data in text files.
Knowing how to view, analyze, edit, and manipulate such files quickly
can save time and improve efficiency.
In this chapter, we will learn:
How to view and analyze text files in different ways.
Commands to sort, search, split, and compare file contents.
Basic and advanced editing tools like Pico and Vim.
By the end of this chapter, you will be able to:
Open and read files in various formats.
Filter and organize data.
Create and edit files efficiently.
7.1.1 A Quick Start: cat
Definition:
cat (short for concatenate) is a Linux command used to view the contents
of files, combine multiple files, and even create new files.
Common Uses:
1. View the content of a file
cat filename.txt
Displays the full content of filename.txt on the terminal.
2. Combine and display multiple files
cat file1.txt file2.txt
Shows the contents of file1.txt followed by file2.txt.
3. Create a new file
cat > newfile.txt
Type your content.
Press CTRL+D to save and exit.
4. Append text to an existing file
cat >> existing.txt
Type new content.
Press CTRL+D to save without overwriting.
Example:
cat fruits.txt
Output:
Apple
Banana
Mango
Orange
7.1.2 Text Sorting
Definition:
sort is a Linux command used to arrange the lines of a text file in alphabetical or
numerical order. It can also reverse the order or sort based on specific fields.
Common Uses:
1. Sort alphabetically
sort filename.txt
Arranges lines in ascending (A–Z) order.
2. Sort in reverse order
sort -r filename.txt
Arranges lines in descending (Z–A) order.
3. Sort numerically
sort -n numbers.txt
Sorts lines as numbers instead of text.
4. Sort by a specific column (useful for tables)
sort -k2 data.txt
Sorts based on the second column.
Example:
cat fruits.txt
Mango
Apple
Orange
Banana
Command:
sort fruits.txt
Output:
Apple
Banana
Mango
Orange
Example 2 – Sorting and Removing Duplicates
File: names.txt
Rahul
Anita
Rahul
Suman
Anita
Command:
sort -u names.txt
Output:
Anita
Rahul
Suman
You can combine options, e.g., sort -nr for numeric sorting in reverse order.
7.1.3 Extract Unique Lines
Definition:
The uniq command in Linux is used to filter out repeated lines from a file.
However, it only removes consecutive duplicates, so files should be sorted first for
best results.
Common Uses:
1. Remove consecutive duplicates
uniq filename.txt
Displays the file content with consecutive duplicates removed.
2. Count occurrences of each line
uniq -c filename.txt
Shows how many times each line appears.
3. Show only duplicate lines
uniq -d filename.txt
Displays only the lines that appear more than once.
4. Show only unique lines
uniq -u filename.txt
Displays lines that appear exactly once.
Example – Removing Duplicates
File: names.txt
Anita
Anita
Rahul
Rahul
Suman
Command:
uniq names.txt
Output:
Anita
Rahul
Suman
Example – Counting Occurrences
uniq -c names.txt
Output:
2 Anita
2 Rahul
1 Suman
For accurate results with all duplicates removed, combine sort with uniq:
sort names.txt | uniq
Search Commands
• /regex→ Searches forward for regex.
Example: /apple → moves cursor to first "apple".
• ?regex→ Searches backward for regex.
Example: ?apple → searches upward for "apple".
• n→ Jumps to the next match.
Example: after /apple, press n to go to the next "apple".
• Shift + n → Jumps to the previous match.
Substitute Commands
1. Current line only
• :s/regex/xyz/
→ Replaces first occurrence of regex in the current line.
Example: :s/apple/orange/
o Line: apple is red → becomes → orange is red.
• :s/regex/xyz/g
→ Replaces all occurrences of regex in the current line.
Example: :s/apple/orange/g
o Line: apple apple pie → becomes → orange orange pie.
• :s/regex/xyz/c
→ Asks for confirmation before each replacement in the line.
Example: :s/apple/orange/c
o Vim will ask replace with orange? (y/n/a/q/l) for each match.
2. Whole file
• :%s/regex/xyz/g
→ Replaces all occurrences of regex in the whole file.
Example: :%s/apple/orange/g
o File becomes:
o orange is red
o orange is sweet
o banana is yellow
o orange pie is tasty
• :%s/regex/xyz/gc
→ Same as above, but asks for confirmation before replacing each one.
3. Between specific lines
• :x,ys/regex/xyz/g
→ Replace between line x and line y.
Example: :2,3s/apple/orange/g
o Only lines 2 and 3 will be checked:
o apple is red
o orange is sweet
o banana is yellow
apple pie is tasty
Finding Matching Lines of Text using grep, egrep
1. Using grep
grep is used for basic text matching with normal regex.
Example 1: Find lines containing "human"
grep "human" genomes.txt
output: H. sapiens (human) - 3,400,000,000 bp - 30.000 genes
Example 2: Find lines containing "human" with line no
grep -n "human" genomes.txt
Output: 1: H. sapiens (human) - 3,400,000,000 bp - 30.000 genes
2. Using egrep (or grep -E)
egrep allows extended regex patterns (like |).
Example 1: Match multiple patterns (|)
egrep 'bacteria|human' genomes.txt
Output: H. sapiens (human) - 3,400,000,000 bp - 30.000 genes
E. coli (bacteria) - 4,670,000 bp - 3237 genes
Text File Comparisons using diff command
Suppose we have two files:
file1.txt
apple
banana
grapes
mango
file2.txt
apple
banana
orange
mango
1.Basic Comparison using “diff”
diff file1.txt file2.txt
Output:
3c3
< grapes
---
> orange
Meaning:
• Line 3 changed (c)
• In file1.txt it was grapes
• In file2.txt it is orange
2. less command
• It’s faster and safer than opening large files in editors like nano or vi.
• It doesn’t modify files — read-only view.
• Allows scrolling, searching, and navigation easily.
Ex: less filename
Key / Command Action
/word Search forward for a word
?word Search backward for a word
n Repeat the last search in the same direction
N Repeat the last search in the opposite direction
g Go to the beginning of the file
G Go to the end of the file
q Quit less
• Open multiple files:
less file1.txt file2.txt
Then use:
• :n → Next file
• :p → Previous file
Search for a word:
Inside less, type:
Ex: /any word
3. Counting Characters, words and Lines
Ex: wc genomes.txt
6 lines, 42 words and 246 characters
It also includes invisible lines break.
4. Splitting Files into pieces
Split command is used to split the file into a series files.
split [options] filename [prefix]
filename → the file you want to split
prefix → (optional) name prefix for output files (default: x)
Ex: split -l 2 genomes.txt genomes.
Explanation
split The command to split files
-l 2 Split into chunks of 2 lines per file
genomes.txt The input file to be split
genomes. The prefix for output files