[go: up one dir, main page]

0% found this document useful (0 votes)
15K views13 pages

Software Engineering For Data Scientists Chap4

Uploaded by

akratiiet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
15K views13 pages

Software Engineering For Data Scientists Chap4

Uploaded by

akratiiet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
OREILLY Unapter 4. Loae Formatung, Linung, and Type Annotations With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as they write—so you can take advan- tage of these technologies long before the official release of these titles. This willbe the 7th chapter of the final book, Please note that the Gititub repo will be made active later on, Ifyou have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chap: (or, please reach out to the author at catherine nelsont@gmail.com. At first, this may seem like an odd choice for a chapter in this book. You right be wondering why the formatting of your code gets so much atten- tion. Why does it matter what your code looks like? Why do people spend their precious time setting standards for the number of spaces around a ‘+ sign? It's because consistent, standardized formatting makes your code much easier to read. And, as I discussed in Chapter 1 if your cade is more readable its much more likely to get reused. malo including in this chapter some tools that will allow you to find mistakes in your code before you run it, Unlike some programming lan- guages, Python code isn’t automatically checked in any way before you vran tS, ifyou have a long script with a mistake inthe final line, you can run the whole thing but the program will only erash when it gets to this line, And if your script takes a long time to run, this is frustrating Linters and associated tools can help you find some of these mistakes he- fore you run your code. ‘My key message in this chapter is that these tools should be automated as far as possible, The details of your cade formatting are very boring. This {s not where you want to spend your time. I'l show you how to make good use of your IDE and set things up so that the standards you need to conform with are automatically applied. Most of the tools I describe inthis chapter dont work in a Jupyter Notebook. ‘They're not appropriate for code thats divided into separate cells, they're de- signed for use on longer scripts. Soin this chapter, Il give examples of how ri the tools I describe on standalone scrips, Code Formatting and Style Guides Formatting your cade according to style guide is an important part of writing good code. A style guide can be set by @ company, for example Google's style guide, or if your company doesn’t have its own style guide, the default is to use PEPS, described below. Code formatting alters the appearance of your code, but it does not change anything about how the code works. Formatting includes things like the location of line breaks in your code, where whitespace is around, an equals sign, or the number of blank lines in a script between different functions, Applying a consistent style makes your code more readable. It's faster to read new code if’ in a consistent style, because it's easier to read code if you know what you are expecting. This consistency also makes it less likely that you will inadvertently introduce syntax errors, for example with missing or mismacched parentheses. Again, this is because it's easier to know what to expect with standardized code. o you use tabs or spaces to indent your code? It's become a cliche that this is something that developers argue about, and there are lengthy in- ternet debates about which of these is better. Spaces show up consistently ‘on any computer, while tabs dont, but tabs are faster to type and reduce the filesize, because there are fewer characters to save. In 2016, Felipe Hoffa, then a developer advocate at Google, analyzed 1 billion files of ‘code (14 terabytes!) to discover whether tabs or spaces were more popu Jar. Spaces were overwhelmingly the most popular. This also gives you an insight about the mindset of some software engi- neers: they can be very focused on the fine details of the code they are writing. In this section, Pl describe the main features of PEPS, how to format code imports, and how you can automate the process of formatting your code. PEPS Python Enhancement Proposal 8, or PEP, isthe document that sets the standards for Python formatting, It isa style guide written by Guido van Rossum, Barry Warsaw and Nick Coghlan in 2001 as a style guide for the Python standard Ubrary as Python first started to become popular. It has been adopted as a default style guide by the Python developer community {o ncrease consistency across everyone writing code in Python, As PEPA states: A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important Consistency within one module or function isthe most important, —PEPB PEP is full f guidelines for what to do and not do in your code. Here's an example of one of these guidelines, saying that you should have a line break after an if statement: 1 Correct: Sf foo == “blah: do_blah_thing() 4 Wrong: 3 fo =~ “blah': do_blah_thing() The code still runs if you use the “wrong” option, but its much easier to ead if you use the “correct” option, PEPS has a lot to say about whitespace, because this is one ofthe things that really helps keep your code readable. For example, it describes best practices for spacing around = signs, and the number of blank lines around a function definition whether it’s inside a class (one) or on its own (¢v0). also has suggestions for how to write comments and choose vari- able names, which I will cover in Chapter 5, It suggests you use spaces for Indentation, not tabs, There's alot more details in PEPS, but you don’t need to go through it and read the whole thing. You can use one of the tools inthis chapter to make sure your code conforms with the style gulde, Black, described below, makes formatting changes for you. Other tools such as fake8 or pylint can highlight your code in your IDE so that you know where you need to make changes. Pl describe these separately in “Tools for Linting and Format Import Formatting Importing external modules frequently causes bugs. Its really easy to for- get to update the modules you import when you update your code, so its good to have a clear list of the modules you are using. EPS sets standards for how to group your imports: Imports should be grouped in the following order: 41, Standard library imports 2. Related third party imports 3. Local application/ubrary specific imports. —PEPS Tortunately, this isnot something you need to do manually. You can use @ tool such as isart to sort your module imports into the correct order. You can install isort with the following command: $ pip install isort Before running isort, your imports might look something like this: import tine from silearn.oetrics import mean_absolute_error import sys, 05 Amport nunpy as rp fron skiearn.nodel_selection inport train test_split Amport pandas as pd fron skearn.noural network Smport MLPRegressor inport natplotlib.pyplot as plt fron skiearn.pipeline inport Pipeline from skiearn.preprocessing import Standardscaler ros skcearn.presrocessing Anport FunctsonTransfermer, OneHlotEncoder ‘You can run isort with the following command: $ isort my_script.py Afterwards, your imports will look ike this: import os import sys import tine import natplotlib.pyplot as plt Amport nunpy as rp mport pandas as pd fron sjcearn.notrics import mean_absolute_error fron skiearn.nodel_selection inport train test_split from sklesrn.neural_netwonk Smport MLPRegressor fron sklearn.pipeline import Pipeline from skiearn.preprocessing import (FunetionTransformer, OneHotEncoder, Standardscaler) ‘This is much easier to read and conforms to PEPS. You can also use isort as a plugin in your IDE. Automatic Code Formatting with Black Black s a tool for automating the code formatting process. It enforces a code style so there's no need for human review. The Idea behind itis that you can write ugly code quickly, then you can save your file and itis mag- ically made beautiful. Black applies @ uniform coding style specified by the tool's authors, which is the reason for the name of this tool: i's taken from Henry Ford's famous quote about being able to choose any color car, as long as its black, Black uses a subset of PEPS with some minor differences. For example, in- stead of limiting the length of all your lines of code to 79 characters, Black wil try and find a sensible place to split longer lines so that they are around 90 characters. You can install Black with: $ pip install black ‘Mluse the following script to show how Black works in practice, and 11 also use it to demonstrate the inting tools in “Tools for Linting and Formatting”. This is the code that produces Figure 2-1, the plot of common Big 0 classes from Chapter 2, bu it has one syntax error, @ missing im- port, and several places where the formatting does not conform with PERS, This is what the code looks like before running Black: Example 41, plot_big_o.py import sstplotlib.pyplot as plt def plot _big_o(save_path) n= np.linspace(1, 18, 1608) Line_nanes = ['Constant” ‘Linear’ , Quadratic’, ‘Exponential’, "Logarithaic’] digo [np.ones(n.shape), n, n**2, 2*¥n, np.2og(n)) fig, ax = plt-subplots () Fig. set_facecolor("white") ax. set_ylin(0,50) for 4 in range(len(big_o)): ax.plot(n, bigof i], label» Line nanes[i ]) ax. set_ylabel (‘Relative Runtine') ‘ax. set_xlabel ("Input Size") ax. Legend() fig. savefig(save_path, bbox_inches='tight") Then, run Black to format the script with the following command: 4$ black plot_big_o.py Black throws an error at this, Because it cannot reformat files with syntax errors. In this case, the function definition is missing its finishing colon, and should read def plot_big_o(save_path): After fixing this error and rerunning Black, the script looks like this: inport natplotlib.pyplot as plt def plot_big_o(save_path): n= np.linspace(1, 18, 1000) Line_nanes = ["Constant", "Linear", ‘Quadratic", "Exponential", ‘Logarithaic™] big_o = [np.ones(n.shape), m, n**2, 2**n, np.2og(n)) fig, ax = plt.subplots() fig.set_facecolor(*uhite") ax.set_ylin(@, 50) for 4 in range(len(big_o)): ax.plot(n, big_o[i], 1abel=Line_nanes{]) ax.set_ylabel ("Relative Runtine”) ax, set_xlabel (“Input Size") ax. legend() fig. saveFig(save_path, bbox_inche: “ignt") black has corrected the formatting throughout the seript. For example, theline ax.plot(n, big_o{ 1], label= Line nanes{i J) has been changed to ax.plot(n, big o[{], label=Line_nanes[}) , which con- forms to the PEPS style guide. ‘You can also preview the changes that Black will make with the following command: $ black plot_big_o.py --diFf If there are any lines of code that you don’t want Black to change, you can add the comment # fmt: skip atthe end of the line. You can also skip a block of code by putting the comment # ft: off at the start of the block, and the comment # fmt: on atthe end of the block. Black cleaned the formatting in Example 4-1 but it didn’t do anything, about the missing import. For that, you'll need to use a linter, which Il describe next. Linting Linting means checking your code for errors before it runs, and another ‘name for itis static analysis. The strange name comes from the lint (fuz2) trap in a clothes dryer, and the first tool todo this function was called Lint because it removed fuzz from code, The original lint tool was devel- ‘oped for the C language in 1978, but now linters are common for all pro- gramming languages. Python linters will analyze your code and warn you of some of the things that would cause your code to fall when you run it. Examples include for- getting one of a pair of brackets, or forgetting to indent the ine after a function definition. i deseribe some common linters fr Python below. Tools for Linting and Formatting There are many tools available that combine both linting and formatting suggestions. Common tools include flake8, pylint, and ruff These tools will spot potential errors and also check for compliance with a style suide, usually either PEP8 or a subset of it In this section, I'l show you how to use these tools on the Example 4.1 script from before. You can install pylint with the following command: $ pip install pylint Then running pylint on Example 41 from the command line gives the fol- lowing result: $ pylint plot_big_o.py ssesersetases podule plot big_o plot_big_o.py:3:26: £0001: Parsing failed: ‘invalid syntax (cunknown>, Line 3)" (syntax-error) ecause there is a syntax error, pylint does not complete its analysis of the entire script It doesn’t say exactly what the error is, but the 3:26 ‘means that the error is inline 3, column 26. In this case, the function defi- nition is missing the trailing Tixing this error and rerunning pylint gives the following output: $ pylint plot_big_o.py strsetenesees Hodule plot_bigo plot_big_o.9y:19:0: C204: Final newline aissing (missing-final-newline) plot_big_o.py:1:0: Co114: Missing rodule docstring (missing-nodule-docstring) plot_big_o.py:3:@: C@i16: Missing function or method docstring (nissing-function-decstring) plot_big_o.py:5:4: C@183: Variable nane “n” doesn’t confora to snake_case naning style (invalié-nam Plet_big_o.py:5:8: £8602: Undefined variable ‘np’ (undefined-variable) plot_big_o.oy:7:13: £9602: Undefined variable ‘np’ (undefined-varsable) plot_big_o.py:7:46: £9692: Undefined variable ‘np’ (undefined-variable) plot_big_o.py:9:9: C@103: Variable nane “ax” doesn't conform to snake_case naming style (Snvalid-na plot_big_o.py:13:4: C8200: Consider using enumerate instead of iterating with range and len (consid This time, pylin is able to scan the rest ofthe script. It reveals a number of errors (messages starting with F, such as €96@2 ) and places where the code does not follow conventions (messages starting with C, such as, (9304 ). You can then use these messages to update the code and fix the errors. It's not clear what the correct version of the code should be, the pyLint documentation provides help. Aakes operates in a very similar way to pylint. You ean install it using the following command: $ pip install flakes ‘Running flake8 on Example 4-1 gives the following output: $ Flakes plot_bie_o.py plot_big_o.py:3:25: £999 SyntaxError: Snvalid syntax This isthe same behavior as pylint: it does not complete the analysis of the script, But stops and flags the syntax error. However, fixing the syntax error and rerunning fakes gives a different output: $ Flakes plot big o.py plot_big_o.py:3:1: £302 expected 2 blank lines, found 1 plot_big_e.py:5:9: F821 undefined name ‘np* plot_big_o.py:6:29: £231 missing whitespace after plot_big_o.py:6:38: £221 missing whitespace after plot_big_o.py plot_big_o.py:6: plot_big_o.py:7:14: F821 undefined mane ‘np’ plet_big_o.py:7:47: F821 undefined mane ‘np’ plot_big_o.py:9:27: E211 whitespace before “(° plot_big_o.py:12:18: £231 missing whitespace after *,* plot_big_o.py:14:26: £201 whitespace after ‘[' plot_big_o.py:14:37: £251 unexpected spaces around keyword / paraneter equals plot_big_o.py:14:50: €262 whitespace before "]" plot_big_o.py:19:48: W292 no newline at end of file : E231 missing whitespace after *, 38 50: £231 missing whitespace after 66 14 flakeS flags different formatting issues than pylint because its using @ ifforent style guide, ls important to choose one liter and stick to ito ensure your code remains consistent. You should agree within your team which linter you will all use, oth pylint and flake8 can be used in this way to check a seript before you deploy it to production, but some IDEs also check your code while you are writing it. Here's an example from VSCode using the pylance extension: eset ylael etaive fate") tnlabel Top S20") bes tepenat ‘ig.cevetigsave path, Rox ichar'tigh) Fe pba @ © froecad Fiance fn Cl 25 tot isnot etna Pyne asrndenca Ln 5, a9) ‘4, tots ot etd Pyne eprundbneiai) [n 7,318) ‘coi atantowd Pyne ornaetsareb) [Un 7,547) Pigure Lining whe coding in cade “This tool underlines spots with errors that will occur at runtime, and also provides a list of more details It can also be configured to use pylint, ‘ake8 or many other inters, Whichever tool you choose, linting your code will save you alot of time by identifying many errors before they happen and making your code consistent. Type Checking Type checking is another way of catching bugs before they cause errors in your code, The term “type” refers to categories of objects used by Python such as integers (int), strings, loats, and so on. A mismatch be- tween the type ofthe input that a function is expecting and the type of the {npue that it receives will cause an error. Tor example, this code sends a string to the function sath.sqrt() when {tis expecting a numeric type (such as an integer ora float) import math Py_int = ‘200° print(nath.sqrt(ny_int)) This gives the following error: TypeError: must be real nusber, not str Adaitionally, Python is a dynamically typed language, and this means that the type of a variable can change. This is as opposed to other languages like Java, where once you set the type ofa variable itis fixed and cannot change, Tor example, this code runs without errors and changes the type of the variable: >>> my_variable = 10 >>> my_variable = ‘hello* >>> type(ay_vartable) wate y_vardable starts out as an integer, but then becomes a string. ‘Types are an extremely common source of bugs. A function may receive a different type than itis expecting, or output a result ofan incorrect type. ‘This is so common that tools have been developed to spot these for you, so that you don't need to write a whole bunch of tests to check for them. Type Annotations Type annotations, also called type hints, were introduced in Python 3.5 to help reduce the number of bugs caused by incorrect types, They tell any- one reading code what type of input a function expects or returns, This helps co make the code much more readable, because the type annotation communicates the expected behavior ofthe function. Additionally, type annotations help with consistency and standardization ina larger codebase. ‘ype annotations are relatively new to Python, and they are still some- what controversial. Some people find that they help with readability, but other people find that they make your code harder to read. There's also extra work involved in adding the annotations and checking them. The developers of Python have stated (in PEP 484) that type annotations will remain optional in Python. If your team or company has a recommenda. tion whether to use them or not, you should follow that recommendation ‘Type annotations follow the format ry variable: type. For example, to define the type that a function returns, use the following forma ‘def ny_function(sone_argunent: type) -> return_type: In this example, P've added type annotations to one of the functions from ‘Chapter 2. This function expects a list as input, and returns a float, as you ‘can now see in the function definition. from collections import Counter def mode_using_counter(sone_list: list) -> float: € = Counter(some_list) return €.most_conmon(1)[@][2] You can also create type annotations using types from outside the Python standard library. Here's an example of how you can write a type annota- tion using a numpy array, import nunpy a5 rp def array_operation(input_array: np.ndarray) -> np.ndarray: However, type annotations don’t actually make any difference to the functioning of your code. For example, in this function one type annota- tion is incorrect: the input is annotated as afloat when it should be alist ‘The code still runs correctly from collections import Counter def mode_vsing_counter(sone list: float) -> float: € = Counter(some_list) return €.most_connon(2)[2]02] ‘Type annotations are only useful if they are used with type checking tools, and the idea is that you run the type checker before running or de ploying your code. The type checker analyses your code and checks for mismatched types. You could also write tests to do this, Bt it's easier and faster with a type checker. Some IDEs support type annotations in their autocomplete functions, but the most popular type checking tool is mypy, ‘hich I will describe how to use in the next section, ‘Type Checking with mypy ‘You can install mypy withthe following command: $ pip install aypy [And then you can run it on any seript with the following command: $ nypy ay_script.py Running mypy on 222, where the type annotations are correct, gives the following output: ro issues found in 1 source file But running mypy on 222 with incorrect type annotations gives this output: ch@7_mypy.py:4: error: No overload variant of "Counter" matches angurent type “float” [call-overio cho7_nypy.py:4: note: Possible overload variants: ch@7_mypy.py:4: nove: def [_T] Counter(self, None = ..., /) -> Counter{_T] che7_aypy. note: def [_T] Counter(self, None = ..., /, **kwargs: int) -> Counter[str] cho7_nypy.py:4: note: def [_T] Counter(self, SupportsKeysAndGetTten[_T, int], /) -> Counterf_T che7_mypy.py:4: note: def [_T] Counter(self, Iterable[_T], /) -> counter{_1] Found 1 error in 1 file (checked 1 source file) Mypy has found the error in the type annotation, so this can be corrected, Then, anyone using this function will know what type it should accept and return, Data Validation with Pydantic Type annotations can also help you confirm (or validate) that your data is inthe format that you are expecting, Pydantic is the most widely used Python data validation library, although you can also do similar things with namedtuple from the typing library. Pydantic uses type annotations to validate data, but itis not a static analy- sis tool like mypy. Instead, the validation happens when you run your code, you don’t need to run a separate tool ‘You can install Pydantic with the following command: $ pip install pydantic Pydantic uses the concept of data schemas to validate data. First, you de- fine a schema that describes the format of your data, then you can use that schema to check that new data is in the correct format, Here's an ex- ample of defining a schema for the UN sustainability data from Chapter 1 ‘You'll see that you can define the type of your data and whether itis re- quired (an error is raised if itis not present or optional fron pydontic import BaseMlodel, Strictrnt fron typing import Optional fron datetine inport datetine class Countrybata(BaseModel): country nane: str@ population: strictint® Literacy rate 2026: Optional[float]© ‘timestamp: Optional[datetine] = None@ © ‘The country name is required, and must be a string or something that can be cast to a string without raising an error, © ‘The population is required, and must be an integer. © The literacy rate is optional, and must be afloat or something that can be cast to afloat. @ The timestamp is optional, and must he a datetime object or some~ thing that can be cast to a datetime object, and the default value is None if no data is passed in. Next, here's an example of data that will be validated as correct: sanple_data_correct = { "country_name": ‘India’, ‘population’: 1417242151, "Uiteracy_rate_2020": 79.43, “tinestanp': datetime.now() You can then use this data to create a new CountryData object as shown below. Ifthe data is ofthe correct format no error is raised. india = Countrydata(**sanple_data_correct) You can look up any of these pieces of data: >>> india. timestamp datetine.datetine(2023, 3, 7, 16, 3, 41, 423508) However, ifyou pass in data that does not fit the requirements, like this: sample_data_incorrect = ¢ “country name": ‘united States", ‘population’: *336262544", ‘Literacy_rate_2028': None, "tinestamp": None united states = Countrybata(**sample_data_incorrect) In this case, the population is a string instead of an integer. Pydantic raises the following error: Validatfonéeror: 1 validation error for CountryData population value is not a valid integer (type-type_error. integer) Pydantic is extremely useful for checking that the input data into a large project is what you're expecting. Il give an example of how to use itin an API in a later chapter. Key Takeaways In this chapter, I described how code formatting, linting and type annota- tions can improve the quality of your code and help increase your pro- ductivity when writing code. Formatting according toa style guide makes your code more readable, Linting and type annotations identify potential errors before your code is deployed co @ production service, A key takeaway here is that you should comply with your team or company’s standards, o introduce standards if they dort already exis. Standardized formatting helps prevent bugs, because it's easier to under- stand what the code is doing. The most important thing you should remember about code formatting, Jinting and type annotating is: automate these tasks as much possible. I's nota Valuable use of your time to do them manually. It may take some time to set up atthe start, but investing time in these tools will definitely pay off over the long term,

You might also like