8000 Add files via upload · oxylabs/python-cache-tutorial@81154c5 · GitHub
[go: up one dir, main page]

Skip to content

Commit 81154c5

Browse files
authored
Add files via upload
1 parent 1db3451 commit 81154c5

File tree

6 files changed

+360
-0
lines changed

6 files changed

+360
-0
lines changed

images/comparison-chart.png

427 KB
Loading

images/output_normal_lru.png

9.43 KB
Loading

images/output_normal_memoized.png

10.4 KB
Loading

readme.md

Lines changed: 288 additions & 0 deletions
< 1C6A td data-grid-cell-id="diff-5a831ea67cf5cf8703b0de46901ab25bd191f56b320053be9332d9a3b0d01d15-empty-191-1" data-selected="false" role="gridcell" style="background-color:var(--diffBlob-additionNum-bgColor, var(--diffBlob-addition-bgColor-num));text-align:center" tabindex="-1" valign="top" class="focusable-grid-cell diff-line-number position-relative left-side">191
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# **Python Cache: How to Speed Up Your Code with Effective Caching**
2+
3+
This article will show you how to use caching in Python with your web
4+
scraping tasks. You can read the [<u>full
5+
article</u>](https://oxylabs.io/blog/python-cache-how-to-use-effectively)
6+
on our blog, where we delve deeper into the different caching
7+
strategies.
8+
9+
## **How to implement a cache in Python**
10+
11+
There are different ways to implement caching in Python for different
12+
caching strategies. Here we’ll see two methods of Python caching for a
13+
simple web scraping example. If you’re new to web scraping, take a look
14+
at our [<u>step-by-step Python web scraping
15+
guide</u>](https://oxylabs.io/blog/python-web-scraping).
16+
17+
### **Install the required libraries**
18+
19+
We’ll use the [<u>requests
20+
library</u>](https://pypi.org/project/requests/) to make HTTP requests
21+
to a website. Install it with
22+
[<u>pip</u>](https://pypi.org/project/pip/) by entering the following
23+
command in your terminal:
24+
25+
python -m pip install requests
26+
27+
Other libraries we’ll use in this project, specifically time and
28+
functools, come natively with Python 3.11.2, so you don’t have to
29+
install them.
30+
31+
### **Method 1: Python caching using a manual decorator**
32+
33+
A [<u>decorator</u>](https://peps.python.org/pep-0318/) in Python is a
34+
function that accepts another function as an argument and outputs a new
35+
function. We can alter the behavior of the original function using a
36+
decorator without changing its source code.
37+
38+
One common use case for decorators is to implement caching. This
39+
involves creating a dictionary to store the function's results and then
40+
saving them in the cache for future use.
41+
42+
Let’s start by creating a simple function that takes a URL as a function
43+
argument, requests that URL, and returns the response text:
44+
45+
def get_html_data(url):
46+
47+
response = requests.get(url)
48+
49+
return response.text
50+
51+
Now, let's move toward creating a memoized version of this function:
52+
53+
def memoize(func):
54+
55+
cache = {}
56+
57+
def wrapper(\*args):
58+
59+
if args in cache:
60+
61+
return cache\[args\]
62+
63+
else:
64+
65+
result = func(\*args)
66+
67+
cache\[args\] = result
68+
69+
return result
70+
71+
return wrapper
72+
73+
@memoize
74+
75+
def get_html_data_cached(url):
76+
77+
response = requests.get(url)
78+
79+
return response.text
80+
81+
The wrapper function determines whether the current input arguments have
82+
been previously cached and, if so, returns the previously cached result.
83+
If not, the code calls the original function and caches the result
84+
before being returned. In this case, we define a memoize decorator that
85+
generates a cache dictionary to hold the results of previous function
86+
calls.
87+
88+
By adding @memoize above the function definition, we can use the memoize
89+
decorator to enhance the get_html_data function. This generates a new
90+
memoized function that we’ve called get_html_data_cached. It only makes
91+
a single network request for a URL and then stores the response in the
92+
cache for further requests.
93+
94+
Let’s use the time module to compare the execution speeds of the
95+
get_html_data function and the memoized get_html_data_cached function:
96+
97+
import time
98+
99+
start_time = time.time()
100+
101+
get_html_data('https://books.toscrape.com/')
102+
103+
print('Time taken (normal function):', time.time() - start_time)
104+
105+
start_time = time.time()
106+
107+
get_html_data_cached('https://books.toscrape.com/')
108+
109+
print('Time taken (memoized function using manual decorator):',
110+
time.time() - start_time)
111+
112+
Here’s what the complete code looks like:
113+
114+
\# Import the required modules
115+
116+
from functools import lru_cache
117+
118+
import time
119+
120+
import requests
121+
122+
\# Function to get the HTML Content
123+
124+
def get_html_data(url):
125+
126+
response = requests.get(url)
127+
128+
return response.text
129+
130+
\# Memoize function to cache the data
131+
132+
def memoize(func):
133+
134+
cache = {}
135+
136+
\# Inner wrapper function to store the data in the cache
137+
138+
def wrapper(\*args):
139+
140+
if args in cache:
141+
142+
return cache\[args\]
143+
144+
else:
145+
146+
result = func(\*args)
147+
148+
cache\[args\] = result
149+
150+
return result
151+
152+
return wrapper
153+
154+
\# Memoized function to get the HTML Content
155+
156+
@memoize
157+
158+
def get_html_data_cached(url):
159+
160+
response = requests.get(url)
161+
162+
return response.text
163+
164+
\# Get the time it took for a normal function
165+
166+
start_time = time.time()
167+
168+
get_html_data('https://books.toscrape.com/')
169+
170+
print('Time taken (normal function):', time.time() - start_time)
171+
172+
\# Get the time it took for a memoized function (manual decorator)
173+
174+
start_time = time.time()
175+
176+
get_html_data_cached('https://books.toscrape.com/')
177+
178+
print('Time taken (memoized function using manual decorator):',
179+
time.time() - start_time)
180+
181+
Here’s the output:
182+
183+
Notice the time difference between the two functions. Both take almost
184+
the same time, but the supremacy of caching lies behind the re-access.
185+
186+
Since we’re making only one request, the memoized function also has to
187+
access data from the main memory. Therefore, with our example, a
188+
significant time difference in execution isn’t expected. However, if you
189+
increase the number of calls to these functions, the time difference
190+
will significantly increase (see [<u>Performance
+
Comparison</u>](#performance-comparison)). 
192+
193+
### **Method 2: Python caching using LRU cache decorator**
194+
195+
Another method to implement caching in Python is to use the built-in
196+
@lru_cache decorator from functools. This decorator implements cache
197+
using the least recently used (LRU) caching strategy. This LRU cache is
198+
a fixed-size cache, which means it’ll discard the data from the cache
199+
that hasn’t been used recently.
200+
201+
To use the @lru_cache decorator, we can create a new function for
202+
extracting HTML content and place the decorator name at the top. Make
203+
sure to import the functools module before using the decorator: 
204+
205+
from functools import lru_cache
206+
207+
@lru_cache(maxsize=None)
208+
209+
def get_html_data_lru(url):
210+
211+
response = requests.get(url)
212+
213+
return response.text
214+
215+
In the above example, the get_html_data_lru method is memoized using the
216+
@lru_cache decorator. The cache can grow indefinitely when the maxsize
217+
option is set to None.
218+
219+
To use the @lru_cache decorator, just add it above the get_html_data_lru
220+
function. Here’s the complete code sample:
221+
222+
\# Import the required modules
223+
224+
from functools import lru_cache
225+
226+
import time
227+
228+
import requests
229+
230+
\# Function for getting HTML Content
231+
232+
def get_html_data(url):
233+
234+
response = requests.get(url)
235+
236+
return response.text
237+
238+
\# Memoized using LRU Cache
239+
240+
@lru_cache(maxsize=None)
241+
242+
def get_html_data_lru(url):
243+
244+
response = requests.get(url)
245+
246+
return response.text
247+
248+
\# Getting time for Normal function to extract HTML content
249+
250+
start_time = time.time()
251+
252+
get_html_data('https://books.toscrape.com/')
253+
254+
print('Time taken (normal function):', time.time() - start_time)
255+
256+
\# Getting time for Memoized function (LRU cache) to extract HTML
257+
content
258+
259+
start_time = time.time()
260+
261+
get_html_data_lru('https://books.toscrape.com/')
262+
263+
print('Time taken (memoized function with LRU cache):', time.time() -
264+
start_time)
265+
266+
This produced the following output:
267+
268+
### **Performance comparison**
269+
270+
In the following table, we’ve determined the execution times of all
271+
three functions for different numbers of requests to these functions:
272+
273+
| **No. of requests** | **Time taken by normal function** | **Time taken by memoized function (manual decorator)** | **Time taken by memoized function (lru_cache decorator)** |
274+
|---------------------|-----------------------------------|--------------------------------------------------------|-----------------------------------------------------------|
275+
| 1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds |
276+
| 10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds |
277+
| 20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds |
278+
| 30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds |
279+
280+
As the number of requests to the functions increases, you can see a
281+
significant reduction in execution times using the caching strategy. The
282+
following comparison chart depicts these results:
283+
284+
The comparison results clearly show that using a caching strategy in
285+
your code can significantly improve overall performance and speed.
286+
287+
Feel free to visit our [<u>blog</u>](https://oxylabs.io/blog) for an
288+
array of intriguing web scraping topics that will keep you hooked!

src/lru_caching.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Import the required modules
2+
from functools import lru_cache
3+
import time
4+
import requests
5+
6+
7+
# Function to get the HTML Content
8+
def get_html_data(url):
9+
response = requests.get(url)
10+
return response.text
11+
12+
13+
# Memoized using LRU Cache
14+
@lru_cache(maxsize=None)
15+
def get_html_data_lru(url):
16+
response = requests.get(url)
17+
return response.text
18+
19+
20+
# Get the time it took for a normal function
21+
start_time = time.time()
22+
get_html_data('https://books.toscrape.com/')
23+
print('Time taken (normal function):', time.time() - start_time)
24+
25+
# Get the time it took for a memoized function (LRU cache)
26+
start_time = time.time()
27+
get_html_data_lru('https://books.toscrape.com/')
28+
print('Time taken (memoized function with LRU cache):', time.time() - start_time)

src/manual_decorator_caching.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Import the required modules
2+
from functools import lru_cache
3+
import time
4+
import requests
5+
6+
7+
# Function to get the HTML Content
8+
def get_html_data(url):
9+
response = requests.get(url)
10+
return response.text
11+
12+
13+
# Memoize function to cache the data
14+
def memoize(func):
15+
cache = {}
16+
17+
# Inner wrapper function to store the data in the cache
18+
def wrapper(*args):
19+
if args in cache:
20+
return cache[args]
21+
else:
22+
result = func(*args)
23+
cache[args] = result
24+
return result
25+
26+
return wrapper
27+
28+
29+
# Memoized function to get the HTML Content
30+
@memoize
31+
def get_html_data_cached(url):
32+
response = requests.get(url)
33+
return response.text
34+
35+
36+
# Get the time it took for a normal function
37+
start_time = time.time()
38+
get_html_data('https://books.toscrape.com/')
39+
print('Time taken (normal function):', time.time() - start_time)
40+
41+
# Get the time it took for a memoized function (manual decorator)
42+
start_time = time.time()
43+
get_html_data_cached('https://books.toscrape.com/')
44+
print('Time taken (memoized function using manual decorator):', time.time() - start_time)

0 commit comments

Comments
 (0)
0