Data Analysis With Python
Data Analysis With Python
PYTHON VARIABLES
VARIABLE TYPE:
How to get serious information about the number of the objects in parentheses
Statistics for Data Analysis and Data Science
Continuous vs Discrete
Continuous: A set of data is said to be continuous is the values belonging to the set can take on ANY
value within a finite or infinite interval. EX: The height of a horse (could be of any value within the
range of horse heights). Time to complete a task (which could be measured to fractions of seconds)
or the speed of a car on route 3 ( assuming legal speed limits).
Discrete: A set of data is said to be discrete if the values belonging to the set are distinct and
separate (Unconnected values). EX: The number of people in your class( no fractional parts of a
person). The number of TV sets in a home (no fractional parts of a TV set). The number of questions
on a math test(no incomplete questions).
What is a Distribution
A probability distribution is mathematical function that, stated in simple terms, can be thought of as
providing the probability of occurrence of different possible outcomes in an experiment.
For discrete information tend to be presented in a bar chart due to the finite number of the
outcomes that can be seen in that type of chart.
The continuous distribution is a continuous line which means that there is unlimited of numbers.
Standard Deviation
According to wiki, Standard Deviation is a measure of the amount of variation of dispersion of a set
of values. A low standard deviation indicates that the vales tend to be close to the mean and the
high standard deviation indicates that the values are spread out over a wider range.
Standard deviation is better because we can add it to our mean. I the example above we add 65.92 +
4.08 which gives us the result 70.
Quick tip to remember the difference between the standard deviation and variance
SKEWNESS
The best way to see the skewness of the data is to see which direction the line is taking, where is
the outliers. On the left image, the image is skewed to the left (Negative) and the right image is
skewed to the right (Positive). Adding numbers at bottom of the graphic helps us have a better
understanding of the skewness.
For the median we divide the number equal and use the number that is left as the median. If the
numbers of numbers are even for example, we just have to divide the two in the middle then we
will get our median.
Another note, the median is not affected by an outlier value, but the mean is affected.
The mode is the most frequent value, in the graphic is the highest point.
Combining the Strings
Str() Function
Title function
1 Myname= “Marcio ”
2 LastName= “Santos”
3 print(Myname.title() + LastName.title())
4 Marcio Santos
LECTURE: LIST
What can I put in a list?
Store numbers: numbers_list= [1,3.9,101]
Store strings: string_list= [“John”, “Mike”, “Tony”]
Store mixed: mixed_list= [3,”apple”, 2, “orange”, “banana”]
Even …
A list of lists! Wow_list = [[1,3.9,101]. [“John”, “Mike”, “Tony”], [3,”apple”, 2, “orange”, “banana”]]
One example
names = ["Marcio", "Joao", "Tony"]
["Hello "+item for item in names]
Result: ['Hello Marcio', 'Hello Joao', 'Hello Tony']
DICTIONARY IN PYTHON
SET
Set is essentially particular type of list where only unique items are stored
Indentation
Is jus t a way to separate a group of codes must be align to the left.
When we use if, else, for and “:” that’s when we should use an indentation
Functions
IMPORT NUMPY
CREATE AN ARRAY
PRINT AN ARRAY
Another way
2 Dimensionals
Third dimensional
GENERATE STATISCS
SERIES CREATION
SERIES OPERATION
INDEXING
Importing data from Excel
In order to import data from Excel is necessary to upload the excel file first to pandas.
Formula example: data_excel=pd.read_excel('Business Analitics.xlsx')
data_excel
Array Transformation
Review DataFrame ix()
11/10/2020
Backup data
12/10/2020
Visualization
%matplolib inline = is the library that pandas uses for plots.
It is necessary to have the files already saved on a computer so that we can import into Jupyter.
13/10/2020
To add a histogram
Df.plot.hist()
The problem with this plot is the fact that do not show the bars of that are hidden.
df.plot.hist(stacked=True)
df.plot.box()
Scatter plot is used to show the distributions of two variables.
df.plot.scatter(x=’male’,y=’female’)
Another way of doing to change the size of the points on the scatter plot
df[score]=df.male*0.3+df.female*0.7
df.head()
Then
df.plot.scatter(x=’male’,y=’female’,s=df.score)
Drawbacks of the scatter plot
Creating heatmaps
df = pd.DataFrame(np.random.randint(100,size=(10,4)), columns=[‘a’,’b’,’c’,’d’])df.head()
df.ix[0].plot.pie()
df.ix[0].plot.pie(figsize=(5,5)) #to make the circular pie plot more circular/e.g:
df.ix[0].plot.pie(figsize=(5,5))
Area plot
If the condition is changing over time, this the plot to use. It is ease to make, to read and provide
loads of information
df.plot.area()
To pick just one area
df.c.plot.area()
15/10/2020
Regression analysis