Notes

Ch1-Documentation:
- Code should be easy to understand. For doing that you have to be

cautious in the choices you take while coding.
Help People Understand:
- Goal is to minimize the time taken by a third person to understand
your code
- Characteristics of a good comment:
o Comments should have a high information to space ratio
o Use comments to make people easily understand your code not
to increase its length
- During coding try to step back on and off and take a holistic look at
your code and ask the question,
o Is it making sense or not ?
o Explain your code in simple words. (It helps a lot)
- A lot is going through your mind while coding. If you feel some of these
thoughts might help the reader’s understanding of the code, do add
them in comments.
Make It Look Good
- WHY: What is easier in the eyes is also easier to understand.
- HOW
o Stay consistent in your explanation
o Align parts of code into columns
o Break apart large blocks of code into logical paragraphs
Break apart large blocks
- Breaking down complex logic into simple ones
- Benefits
o Helps the reader identify the main concepts in the code
o Make the testing and debugging process easier
- Write as little new code as possible but how?
o Eliminate non-essential feature
o Rethinking requirements for finding the easiest way possible
o Staying familiar with the standard libraries at hand. You can do
this by reading their API’s periodically.
Ch2-Git and Version Control Best Practices
- Git is a version control system for tracking changes in computer files
and coordinating work on those files among multiple people.
Best Practices:
Commit
- A commit should be a wrapper for related changes
- Committing should be done often for every small change in your code.
This helps your teammates understand timely integrate the changes
you made to avoid merge conflicts.
- Never commit half done work but rather divide your work in smaller
chunk and commit each chunk as soon as it is completed and tested.
- Write descriptive messages for your commits so that your team mates
easily understand what you are doing. Your message should answer
the following two questions,
o What was the motivation for the change.
o How does it differ from the previous implementation.
- Always link your commit to a task (If you are using a task tracking tool
like Jira and if you don’t have a task make one. ) Give Example
- In case you are not using a task tracking tool make some categories for
commit. By this the team can easily filter relevant commit through
category.
Version Control is not a Backup System
- Commit semantically don’t just cram files in it. It is not for backup.
CH6- Branding:
- What’s the purpose of your data science project ?
o The ultimate goal should be to make sure that your project
makes a difference and this is what this chapter will be making
sure by telling you how to do proper branding of yourself and
your code.
- 7 Steps,
- Naming/branding:
o Name should be MBA not that MBA but
 Memorable : You have certain options here,
 Pronounceable: Sharp sound and roll of tongue
(Amazon)
 Plain: Redifining a simple word like (apple)
 Produced: Developing a entirely new word, (Google)
 Brandable: Staying consistent in maintaining your brand
identity example Ford truck or Apple’s (Think Different)
 Available: Don’t forget to google your selected name if it
is already owned by someone else or not.
- Document your Product:

o So, aligning with the constitution number 4 of Cbt you
should not hoard knowledge. That’s why you should
document your code
o Make your code standalone by making it easily understandable
for other coders and potential users.
- Identify Power users:
o Identify the big boys who can spread the word (product in your
case)
After all this work and scrutiny you will be desperate to publish your work but
wait you should,
- Time your launch:
 Inner assessment
 Is your team ready?
 How can you maximize your gain from the launch
 Outer assessment
 Whats happening in your neighbourhood
 Is their saturation of your product type atm ?
o Cater to your audience
 Identify your audience
 Stay in contact with your audience
o Demo it (Just like Apple used to do)
o Track it: Fareb e nazar hai sakoon o sabaat There is no
perfect technique
 Keep working on your code
 Don’t hesitate to critique on your own work
CH7-Reproducible Data Science

- Reproducibility : creating code that can be deployed and run anywhere
while at the same time giving the same results.
- Containerization (Containerization is the packaging of software code
with just the operating system (OS) libraries and dependencies
required to run the code to create a single lightweight executable—
called a container—that runs consistently on any infrastructure.) and
docker
- Docker: is a computer program that does containerization. performs
operating-system-level virtualization also known as containerization.
- Why Docking
Avoids the overhead of maintaining VM’s
Its reproducible; A docker image is the boilerplate or template for a
container. It can be created by using a Dockerfile that describes how the
image is being build. This image, once built / compiled, can then be used to
start new containers.
Its isolated;
Its portable; You can share the Dockerfile with colleagues, check it in to git,
and compile and store the docker image in a docker repository like
Dockerhub. Besides sharing the same data science and having the exact
same infrastructure for a project,
Dockerized Workflow
- PRE-BUILD IMAGE: For instance, if you want to get started with
Rstudio running R version 3.5.0 with the Tidyverse already installed
then you can get up an running with the following command:
o If the image is not already available on your local computer, then

it will download it when you run the command.
o This is what the command means:docker: tells the shell that we
are executing a docker commandrun: will start the image -d: will
start the image in the background (meaning you can use your
terminal after the command) and will not dispaly any logs.--name
rstudio: gives the container the name rstudio-p 8787:8787: will
map the internal port 8787 (Rstudio Servers default port) to the
external port 7878. Meaning you can access it at
localhost:7878rocker/tidyverse:3.5.0: the image to be used is the
tidyverse image from rocker version 3.5.0
- CUSTOM BUILD IMAGE: You can create a custom image if you want
to install more packages in the above created image,
o Next you need to build this dockerfile into an image. You do this
by running this command in the folder where the Dockerfile
recides:
o docker build -t my_custom_image:v1
o This will create an image called my_custom_image with v1 as the
version number. You can change that to whatever you want.
Docker Compose: For Multi Container Applications

- For Pre-Build: The composed script below runs a “service” called
rstudio. This service starts up a container called rstudio based on the
rocker/tidyverse:3.5.0 image and exposes the 8787 port to 7878 just
like we did above.
- FOR CUSTOM BUILD: The compose script below runs a “service”

called rstudio. This service starts up a container called rstudio but
unlike the compose script above, this one builds the image from the
Dockerfile in the same folder (hence the .). It then exposes the 8787
port to 7878 just like the other one.
CH8- Pipelines
- Think of different functions / operations / containers as blocks. For
executing the whole project, you need to connect all of them the right
just like a plumber fits the pipes in a specific manner to make the
underground pipeline of a city work right. This same process of
connecting all the blocks of a project is called pipelining
Example:
https://streamsets.com/blog/getting-started-streamsets-data-collector-
docker/
CH9-Defensive Programming
- From whom to defend ? From Failing ?? NO!! Its about
o ensuring that any failures are quick to surface,
o hard to miss, and
o easy to understand.
- 3 things from whom we have to save our program
o Unanticipated user inputs
 In correct format of the input (TRUE instead of true)
 Incorrect unit of the input (meters instead of
centimeters )
 Incorrect option (linear instead of (hyperbolic and
polynomial))
o Unanticipated results
 You expected a vector from a certain function which you
used in your code but it returned a list instead.
o Unreliable processes
 These processes are prone to random failures
 For example, imagine your code is opening teams and
uploading a mp4 file in it. You can come from the internet
connectivity.
Principles of Defensive Programming
- Make sure if your program fails it should fail,
o Conspicuously
 Errors produced with stop()
 Warnings() are used when your program can mostly
achieve what was required but the output might not be the
same as expected
 Messages() are used to give the user status updates at
different points of the program. The user can then infer
from those messages if there is something going wrong or
not.
 The good things about these above mentioned error
messages are
 Bright color
 User can call traceback on any condition to identify
where the error originated from
 Selected warnings and messages can be ignored
using suppressMessages() and suppressWarnings().
o Fast
 It’s almost always better for a function to fail right away
than to wait and keep trying for two reasons because
otherwise
 It takes time and computational resource to keep on
trying
 The program will eventually terminate abnormally
leaving behind messy partial outputs
 The best way of finding errors fast is using if statement s
coupled with error messages mentioned above (stop(),
message(), warning())
 R offers some built-in error handling for some common
outputs.
 Arguments in a function are not equal to the
parameters in the function definition
o For less arguments: argument "x" is missing,
with no default.
o For more argument : unused argument (y = 3)
o Match.org for matching / comparing values
from a defined set just like we did in the vote()
in pset3
One Exception where retries are better
 If you are processing something in your code which has
something to do with the internet retry can be beneficial
because internet is a flaky process.
o Informatively
 If on a certain cases where it’s inevitable for your program
to fail, you need to communicate clearly to the user, what
went wrong.
 For doing that you can use if and stop() but if you want to
save time by writing less code but still get this job of
writing elaborative error messages done you can use,
 checkmate, assertive, assertr, and assertthat
packages.
Balancing Defensiveness with Efficiency

- Defensive programming is an art, it requires,
a. Imagination to think if all the failure cases
b. It requires your judgement to know how many tests are enough
- While deciding which tests to create consider the following.

o Likely forms of bad input (Potentially false formats expected)
o What if you were a confused user what would you input ???
o Higher bound input : Which inputs will be the most catastrophic
o Higher bound output\
o Will users be calling this function directly, or can you control the
range of inputs by keeping this function internal to your
package?
CH10-Ethics
- As Voltaire says, “With great power comes great responsibility”
o Unfair Bias:
o What is in the model: A sound understanding of the model you
are using gives you more control over the model and its results
o Privacy by Design: The privacy of the data which is fetched
and used should be taken care of.

Notes

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

Ch1-Documentation:

- Code should be easy to understand. For doing that you have to be

- Document your Product:

CH7-Reproducible Data Science

o If the image is not already available on your local computer, then

Docker Compose: For Multi Container Applications

- FOR CUSTOM BUILD: The compose script below runs a “service”

Balancing Defensiveness with Efficiency

- While deciding which tests to create consider the following.

You might also like