[go: up one dir, main page]

0% found this document useful (0 votes)
132 views1 page

Cheat Sheet: Learn Python For Data Science Interactively at

1. The document discusses various data manipulation and analysis techniques using the Pandas library in Python such as selecting, filtering, querying, reshaping, merging, and indexing data. 2. It provides examples of how to pivot/reshape data, select columns based on conditions, query a DataFrame, merge/join datasets, and set/reset the index. 3. The techniques described allow exploring, cleaning, and preparing data for further analysis using Pandas functions like loc, query, merge, pivot, and reindex.

Uploaded by

Juan Galvan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views1 page

Cheat Sheet: Learn Python For Data Science Interactively at

1. The document discusses various data manipulation and analysis techniques using the Pandas library in Python such as selecting, filtering, querying, reshaping, merging, and indexing data. 2. It provides examples of how to pivot/reshape data, select columns based on conditions, query a DataFrame, merge/join datasets, and set/reset the index. 3. The techniques described allow exploring, cleaning, and preparing data for further analysis using Pandas functions like loc, query, merge, pivot, and reindex.

Uploaded by

Juan Galvan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PYTHON FOR DATA SCIENCE Iteration Also see NumPy Arrays Combining Data

Cheat Sheet Selecting


>>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1
data1
X2 X1
data2
X3
>>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
a 11.432 a 20.784
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN

PANDAS
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Learn Python for Data Science Interactively at >>> df3.filter(items=”a”,”b”])
>>> df.select(lambda x: not x%5)
Filter on values
Select specific elements
www.DataCamp.com Where
>>> s.where(s > 0) Subset the data
Setting/Resetting Index
Query
>>> df6.query(‘second > first’) Query DataFrame >>> pd.merge(data1, X1 X2 X3
data2, a 11.432 20.784
Reshaping Data how=’left’,
b 1.303 NaN
on=’X1’)
c 99.906 NaN
Pivot Setting/Resetting Index
>>> pd.merge(data1, X1 X2 X3
>>> df3= df2.pivot (index=’Date’, Spread rows into columns >>> df.set_index(‘Country’) Set the index
columns=’Type’, >>> df4 = df.reset_index() Reset the index data2, a 11.432 20.784
values=’Value’) >>> df = df.rename (index=str, Rename DataFrame how=’right’,
b 1.303 NaN
columns={“Country”:”cntry”, on=’X1’)
d NaN 20.784
“Capital”:”cptl”,
Date Type Value Type a b c “Population”:”ppltn”})
>>> pd.merge(data1, X1 X2 X3
0 2016-03-01 a 11.432 Date data2, a 11.432 20.784
1 2016-03-02 b 13.031 2016-03-01 11.432 NaN 20.784 Reindexing how=’inner’,
b 1.303 NaN
on=’X1’)
2 2016-03-01 c 20.784 2016-03-02 1.303 13.031 NaN
>>> s2 = s.reindex([‘a’,’c’,’d’,’e’,’b’])
3 2016-03-03 a 99.906 2016-03-03 99.906 NaN 20.784
4 2016-03-02 a 1.303 >>> pd.merge(data1, X1 X2 X3
5 2016-03-03 c 20.784 Forward Filling Backward Filling data2, a 11.432 20.784
>>> df.reindex (range(4), >>> s3 = s.reindex (range(5), how=’outer’,
b 1.303 NaN
method=’ffill’) method=’bfill’) on=’X1’)
c 99.906 NaN
Pivot Table Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> df4 = pd.pivot_table(df2, Spread rows into columns 1 India New Delhi 1303171035 2 3
values=’Value’, 2 Brazil Brasília 207847528 3 3
index=’Date’, 3 Brazil Brasília 207847528 4 3 Join
columns=’Type’])
>>> data1.join(data2, how=’right’)
MultiIndexing
Stack / Unstack
>>> arrays = [np.array([1,2,3]), MultiIndexing
>>> stacked = df5.stack() Pivot a level of column labels np.array([5,4,3])]
>>> stacked.unstack() Pivot a level of index labels >>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) Vertical
>>> tuples = list(zip(*arrays)) >>> s.append(s2)
0 1 1 5 0 0.233482 Horizontal/Vertical
>>> index = pd.MultiIndex.from_tuples(tuples,
1 5 0.233482 0.390959 1 0.390959 names=[‘first’, ‘second’]) >>> pd.concat([s,s2],axis=1, keys=[‘One’,’Two’])
2 4 0.184713 0.237102 2 4 0 0.184713 >>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index) >>> pd.concat([data1, data2], axis=1, join=’inner’)
>>> df2.set_index([“Date”, “Type”])
3 3 0.433522 0.429401 1 0.237102
Unstacked 3 3 0 0.433522
1 0.429401
Stacked Dates
Duplicate Data >>> df2[‘Date’]= pd.to_datetime(df2[‘Date’])
Melt >>> df2[‘Date’]= pd.date_range(‘2000-1-1’,
>>> s3.unique() Return unique values periods=6,
>>> pd.melt(df2, Gather columns into rows >>> df2.duplicated(‘Type’) Check duplicates freq=’M’)
id_vars=[“Date”], >>> df2.drop_duplicates(‘Type’, keep=’last’) Drop duplicates >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
value_vars=[“Type”,“Value”], >>> df.index.duplicated() Check index duplicates >>> index = pd.DatetimeIndex(dates)
value_name=”Observations”) >>> index = pd.date_range(datetime(2012,2,1), end, freq=’BM’)

Date Type Value Date Variable Observations


0 2016-03-01 a 11.432 0 2016-03-01 Type a
Grouping Data
1 2016-03-02 b 13.031 1 2016-03-02 Type b Visualization
2 2016-03-01 c 20.784 2 2016-03-01 Type c
Aggregation
3 2016-03-03 a 99.906 3 2016-03-03 Type a >>> df2.groupby(by=[‘Date’,’Type’]).mean() >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 4 2016-03-02 Type a >>> df4.groupby(level=0).sum()
>>> df4.groupby(level=0).agg({‘a’:lambda x:sum(x)/len(x), >>> s.plot() >>> df2.plot()
5 2016-03-03 c 20.784 5 2016-03-03 Type c
‘b’: np.sum}) >>> plt.show() >>> plt.show()
6 2016-03-01 Value 11.432 Transformation
7 2016-03-02 Value 13.031 >>> customSum = lambda x: (x+x%2)
>>> df4.groupby(level=0).transform(customSum)
8 2016-03-01 Value 20.784
9 2016-03-03 Value 99.906
10 2016-03-02 Value 1.303
11 2016-03-03 Value 20.784
Missing Data
>>> df.dropna() Drop NaN values
Iteration >>> df3.fillna(df3.mean()) Fill NaN values with a pre
determined value
>>> df.iteritems() (Column-index, Series) pairs >>> df2.replace(“a”, “f”) Replace values with others
>>> df.iterrows() (Row-index, Series) pairs

You might also like