[go: up one dir, main page]

0% found this document useful (0 votes)
22 views17 pages

CH 3 2

Uploaded by

mr explorer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

CH 3 2

Uploaded by

mr explorer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 17
Python for Data Science 3-10 Getting Your Hands Dirty with Data * A visual representation of how this image is stored as a NumPy array is as follows ; 1" dimension == import skimage.io # read image | image = skimage.ic.imread(fname="wtable jpg") | import skimage.viewer # display image viewer = skimage. viewer. ImageViewer(image) | viewer.show() | Managing Data from Relational Databases * Relational databases accomplish both the manipulation and data retrieval objectives. SQL perform all sorts of management tasks in a relational database, retrieve data as needed. Relational database consists of one or more tables of information. The rows in the table are called records and the columns in the table are called fields or attributes. A database that contains two or more related tables is called a relational database. When working on a data science project, you may want to connect Python scripts with databases. A library known as SQLAlchemy bridges the gap between SQL and Python. The first thing to do is to create an engine which is an object that is used to manage the connection to the database. from sqlalchemy import create_engine from sqlalchemy.orm import scoped_session, sessionmaker engine = create_engine(‘postgresql://login:password@localhost:5432/flight") ¢ The general format to create an engine is: “ create_engine("“postgresql://login:passeword@localhost:5432/name_database”) © The sqlalchemy library provides support for SQL databases like SQLite, MySQL: PostgreSQL and SQL Server | TECHNICAL PUBLICATIONS® . An up thrust for knowledge | a -— patton for Date Science 3-11 Getting Your Hands Dirty with Data. interacting with Data from NoSQL Databases The Not only SQL (NoSQL) databases are used in large data storage scenarios in which the relational model can become overly complex. « NoSQL databases provide features of retrieval and storage of data in a much different way than their relational database counterparts. The first thing that we need to do in order to establish a connection is import the MongoClient class. We'll use this to communicate with the running database instance. Use the following code to do so: from pymongo import MongoClient client = MongoClient) EX Conditioning Your Data : Juggling between NumPy and Pandas 4. NumPy « Numpy is the core library for scientific computing in Python. It provides a high- performance multidimensional array object, and tools for working with these arrays « NumPy is the fundamental package needed for scientific computing with Python. It contains : a) A powerful N-dimensional array object b) Basic linear algebra functions ©) Basic Fourier transforms d) Sophisticated random number capabilities e) Tools for integrating Fortran code ) Tools for integrating C/C++ code * NumPy is an extension package to Python for array programming. It provides “closer to the hardware” optimization, which in Python means C implementation. 2. Pandas * Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. * DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables. * Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed ‘TECHNICAL PUBLICATIONS® - An up thrust for knowledge Python for Data Science 3-12 Getting Your Hands Dirty with Dy ta statistical analysis in SciPy, plotting functions from Matplotlib, and maching | learning algorithms in Scikit-learn. | + Pandas is the library for data manipulation and analysis. Usually, it is the starting | point for your data science tasks. It allows you to read/write data from/to multiple sources. Process the missing data, align your data, reshape it, merge and join it with other data, search data, group it, slice it. Figuring out what's in Your Data | Duplicate data creates the problem for data science project. If database is large, then processing duplicate data means wastage of time. | « Finding duplicates is important because it will save time, space false result. how to easily and efficiently you can remove the duplicate data using drop_duplicates() function in pandas. | © Create Dataframe with Duplicate data import pandas as pd raw_data = {'first_name’: ['rupali’, ‘rupali’ ‘last_name': [‘dhotre’, 'dhott ': (12, 12, 1111111, 36, 24, 73), ‘TestScorel': (4, 4, 4, 31, 2, 3), "TestScore2': [25, 25, 25, 57, 62, 70]} df = pd.DataFrame(raw_date, columns = |'first_name', ‘last_name’, ‘age’, ‘preTestScore', ‘postTestScore’]) ‘rakshita’'sangeeta’, 'mahesh’, ‘vilas'}, Aut, Jadhav’, 'bagad'), TestScorel Test Score2 oui Drop duplicates | df.drop_duplicates() | * Drop duplicates in the first name column, but take the last observation in the | duplicated set | df.drop_duplicates({ snus |, keep= TECHNICAL PUBLICATIONS® - An up thrust for knowledge : a pytnon for Date Science 3-13 Getting Your Hands Dirty with Data pe Creating a Data Map and Data Plan « Overview of dataset is given by data map. Data map is used for finding potential problems in data, such as redundant variables, possible errors, missing values and variable transformations, + Try creating a Python script that converts a Python dictionary into a Pandas DataFrame, then print the DataFrame to screen. import pandas as pd scottish hills = {’Ben Nevis’: (1345, 56.79685, 6.003508), "Ben Macdut': (1909, 7.070453, -3.668262), 'Braeriach': (1296, 57.078628, -3.728024), ‘Caim Toul’: (1291, 57.054611, -3.71042), ‘Sgr an Lochain Uaine': (1258, 57.057999, -3.725416)} | dotaframe = pa.DataFrame(scottish_hills) print(dataframe) Manipulating & Creating Categorical Variables ‘© Categorical variable is one that has a specific value from a limited selection of values. The number of values is usually fixed. * Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you are working with is related to products, you will find features like product type, manufacturer, seller and so on. * Method for creating a categorical variable and then using it to check whether some data falls within the specified limits. import pandas as pd cycle_colors = pd.Series(|'Blue’, ‘Red’, ‘Green’, dtype='category’) cycle_data = pd Series( pd.Categorical({'Yellow’, ‘Green’, 'Red, ‘Blue’, Purple], categories=cycle_colors, ordered=False)) find_entries = pd isnull(cycle_data) Print cycle colors print Print cycle data print Print find_entries[find_entrie: True] * Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green as color. TECHNICAL PUBLICATIONS® - An up thrust for knowiedge at Python for Data Science 3-14 £3 Renaming Levels & Combining Level * Data frame variable names are typically used many times when wrangling day, Good names for these variables make it easier to write and read wrangling programs. * Categorical data has a categories and a ordered property, which list their Possible values and whether the ordering matters or not. ‘ae Geting Your Hands Dity wih Dy, | | * Renaming categories is done by assigning new values to the Series.cat.categoriog property or by using the Categorical.rename_categories() method: In [41]: 8 = pd.Series({"a',"b',’c','a"), dtype="category") In [41]: 6 Out]43}; Oa 1b 2 3a dtype: category Categories (3, object): [a, b, c] In (44): s.cat.catagories = [Group %s" % g for g in s.cat.categories| In [45]: 5 Out[45}: 0 Groupa 1 Group b 2 Groupe 3 Groupa dtype: category Categories (3, object): [Group a, Group b, Group c] In [46}: s.cat.rename_categories({1,2,3}) Outl46}: 01 12 a 3 a 1 dtype: category Categories (3, inté4): (1, 2, 3] TECHNICAL PUBLICATIONS® - An up thrust for knowledge pynan for Data Science 3-15 Getting Your Hands Diy oe pa Dealing with Dates and Times Values = «Dates are often provided in different formats and must be con; i format DateTime objects before analysis, Verted into single python provides two methods of formatting date and time. 1, str()=it turns a datetime value into a string without any formatting. 2, strftime() function = it define how user want the datetime val ue to, conversion. appear after 4,Using pandas.to_datetime() with a date import pandas as Pa # input in mm.dd.yyyy format date = (21.07.2020 # output in yyyy-mm-dd format print(pd.to_datetime(date)) 2.Using pandas.to_datetime() with a date and time import pandas as pd # date (mm.dd-yyyy) and time (H:MM:SS) date = [21.07.2020 11:31:01 AM'] # output in yyyy-mm-dd HH-MM:SS print(pd.to_datetime(date)) * We can convert a string to datetime using strptime() function. This function is available in datetime and time modules to parse a string to datetime and time objects respectively. * Python strptime() is a class method in datetime class. Its syntax is: datetime.strptime(date_string, format) * Both the arguments are mandatory and should be string import datetime Lea = "ha Sb Sel EE KMS HY" ar, tetime datetime.today() SO today as today strf ae oot} ds 4 Frnt datetime strptime(s, format) town re >> from sklearn feature_extraction.text import CountVectorizer >>> from sklearn. pipeline import Pipeline >>> import numpy as np >>> corpus = ['this is the first document, ‘this document is the second document’, ‘and this is the third one’, ‘is this the first document!] >>> vocabulary = ['his’, ‘document, first, is, ‘second’, the’ ‘and’, 'one'] >>> pipe = Pipeline({('count’, CountVectorizer(vocabulary= (tfid’, THidfTransformer()))) it(corpus) >>> pipel'count'|.transform(corpus).toarray() Aray([[1, 1, 1, 1, 0, 1, 0, 0), (1,2, 0, 1, 1, 1, 0, O}, vocabulary), al PUBLICATIONS® - An up thrust for knowledge Python for Data Science 3-26 Getting Your Hands IY ity [1,0, 0, 4, 0, 1, 1, 1), 11, 1, 1, 1, 0, 1, 0, OW) >>> pipel'tfid').idf_ array([1, _, 1.22314955, 151082562, 1. _, 1.91629073, >>> pipe.transform(corpus).shape (4, 8) EJ Understanding the Adjacency Matrix * Anadjacency matrix represents the connections between nodes of a graph. 1. _, 1,91629073, 1.91629073)) ¢ The row and column indices represent the vertices: matrix[i][j] = Imatrix{i}fj] 1 means that there is an edge from vertices ii to jj, and matrixfi][j] = Omatrix{ijfj) <9 | denotes that there is no edge between i and j. © The advantage of the adjacency matrix is that it is simple, and for small graphs itis easy to see which nodes are connected to other nodes. ¢ Start Python and import NetworkX ° Different classes exist for directed and undirected networks. Let's create a basic undirected Graph: g = nx.Graph() # empty graph e The graph g can be grown in several ways. NetworkX provides many generator functions and facilities to read and write graphs in many formats. Example: import networkx as nx # Create a networkx graph object my_graph = nx.Graph() # Add edges to to the graph object # Each tuple represents an edge between two nodes my_graph.add_edges_from(| (1,2), (1,3), (3,4), (1,5), (3,5), (4,2), (2,3), (3,0))) # Draw the resulting graph nx.draw(my_graph, with labels=True, font, ed TECHNICAL PUBLICATIONS® - An up thrust for knowiedge gad

You might also like