diff --git a/22__pandas-how-to-filter-results-of-value_counts.patch b/22__pandas-how-to-filter-results-of-value_counts.patch new file mode 100644 index 0000000..e69de29 diff --git a/README.md b/README.md new file mode 100644 index 0000000..32615a6 --- /dev/null +++ b/README.md @@ -0,0 +1,101 @@ +# python +Jupyter notebooks and datasets for the interesting pandas/python/data science video series. + +# Contribution + +Feel free to contribute or suggest new ideas. To get in touch write on [mail](mailto:grouprivl@gmail.com?subject=[GitHub]%20Source%20Python). + +You can find nice guide about GitHub contribution: +* [Contributing to projects](https://docs.github.com/en/get-started/quickstart/contributing-to-projects) +* [Step-by-step guide to contributing on GitHub](https://www.dataschool.io/how-to-contribute-on-github/) + +# Who is this repo for? + +For people who are interested in data science, data analysis and finding interesting insights for data. This repository is related to sites: +* [DataScientYst.com - Data Science Tutorials, Exercises, Guides, Videos with Python and Pandas](https://datascientyst.com/) +* [SoftHints.com - Python, Pandas, Linux, SQL Tutorials and Guides](https://softhints.com/) + +where you can find more interesting articles. + +New website dedicated to Pandas and Data Science was started: https://datascientyst.com/. It has better organization and covers topics in many areas. + + +The youtube channel is: + +* [SoftHints Youtube](https://www.youtube.com/@softhints/) +* [Popular Videos](https://www.youtube.com/@softhints/videos) + +# Latest Videos + +## Pandas + +0. [Pandas Tutorial : How to split columns of dataframe](https://www.youtube.com/watch?v=cCoGsFVPVh0&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +1. [Pandas Tutorial : How to split dataframe by string or date](https://www.youtube.com/watch?v=7sgDvC4k6Xg&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +2. [Easily extract tables from websites with pandas and python](https://www.youtube.com/watch?v=OXA_ZD1gR6A&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +3. [Easily extract information from excel with Python and Pandas](https://www.youtube.com/watch?v=hJMH_1o8eU0&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +4. [Extract tabular data from PDF with Python - Tabula, Camelot, PyPDF2](https://www.youtube.com/watch?v=702lkQbZx50&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +5. [Pandas is column part of another column in the same row of dataframe](https://www.youtube.com/watch?v=duOHHDqI40c&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +6. [Load multiple CSV files into a single Dataframe](https://www.youtube.com/watch?v=30ndwJm1I5c&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +7. [Analyze top youtube channels 2019 with pandas - PewDiePie I](https://www.youtube.com/watch?v=mG9OnH9R5yM&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +8. [dataframe column transformations ( str, int, category, concat)](https://www.youtube.com/watch?v=5pbRivDYzko&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +9. [Pandas DataFrame generate n-level hierarchical JSON](https://www.youtube.com/watch?v=lCcE-0bykRU&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +10. [Pandas How add new column existing DataFrame](https://www.youtube.com/watch?v=UvCO5gKQqtE&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +11. [Python Pandas find and drop duplicate data](https://www.youtube.com/watch?v=4ixLp8aFomw&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +12. [Map the headers to a column with pandas?](https://www.youtube.com/watch?v=3g6KG_8zq0E&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +13. [Pandas count values in a column of type list](https://www.youtube.com/watch?v=lx7KFd6BPcg&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +14. [How to Optimize and Speed Up Pandas](https://www.youtube.com/watch?v=nW5ltiwV-6Y&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +15. [Pandas count and percentage by value for a column](https://www.youtube.com/watch?v=P5pxJkv71BU&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) +16. [Pandas use a list of values to select rows from a column](https://www.youtube.com/watch?v=jlSbo5wmTPQ&list=PLeicpQTG639FTJ-daMp7YWmQLBH3zumXv) + + +## python + +0. [python string split by separator](https://www.youtube.com/watch?v=iBsg75W2Vig&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +1. [python random number generation examples](https://www.youtube.com/watch?v=WDTnZgSreL4&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +2. [bilingual programming education in java and python](https://www.youtube.com/watch?v=eEHBjP06WSI&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +3. [biggest programmer salaries 2018](https://www.youtube.com/watch?v=X2bUUkWC7dE&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +4. [python extract text from image or pdf](https://www.youtube.com/watch?v=PK-GvWWQ03g&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +5. [Python read validate and import CSV JSON file to MySQL](https://www.youtube.com/watch?v=WbW0rHCX2UU&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +6. [python regex match date](https://www.youtube.com/watch?v=o8Je7hPgsdU&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +7. [python regex cheat sheet with examples](https://www.youtube.com/watch?v=o_CSmob64uU&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +8. [python string methods tutorial](https://www.youtube.com/watch?v=7yuPVq9DtV0&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +9. [python shuffle list](https://www.youtube.com/watch?v=WFRBxz6AeZI&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +10. [Easy install of Python and PyCharm on Windows](https://www.youtube.com/watch?v=cDOlBRzHRI0&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +11. [learn python for beginners complete tutorial 2018](https://www.youtube.com/watch?v=hnc3bGtYQsQ&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +12. [think python chaper 2](https://www.youtube.com/watch?v=A6EIl677ntQ&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +13. [Python/Java bad and good code comments examples](https://www.youtube.com/watch?v=SRCToEkq7to&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +14. [intellij pycharm surround string quote](https://www.youtube.com/watch?v=AgRHEGB8Urs&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +15. [Top Five Most Annoying Programming Mistakes For Beginners with Python](https://www.youtube.com/watch?v=JToPoYip-C4&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +16. [No Python Interpreter Configured For The Module - PyCharm/IntelliJ](https://www.youtube.com/watch?v=mkKDI6y2kyE&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +17. [python split string into list examples](https://www.youtube.com/watch?v=T8EfomTlcfA&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +18. [How to migrate/update virtualenv from Python 3.5 to 3.6](https://www.youtube.com/watch?v=cFTB5EJUxzw&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +19. [Python String Remove Last n Characters](https://www.youtube.com/watch?v=hZHfdOKFlAw&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +20. [Python Pandas 7 examples of filters and lambda apply](https://www.youtube.com/watch?v=7nYkJctgSSA&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +21. [The simplest way to run python headless test with Chrome on Ubuntu](https://www.youtube.com/watch?v=BdppFIT_lIs&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +22. [Python 3 Simple Examples get current folder and go to parent](https://www.youtube.com/watch?v=tQ_9a6UhUQs&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +23. [python join/merge list two and more lists](https://www.youtube.com/watch?v=-zcJ4uB7XUo&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +24. [Easy way to convert dictionary to SQL insert with Python](https://www.youtube.com/watch?v=hUXGQwTSfMs&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +25. [Python 3 detect and prevent TypeError-s](https://www.youtube.com/watch?v=DJd0JYaVkqA&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +26. [The right way to declare multiple variables in Python](https://www.youtube.com/watch?v=8OoLg39nNlo&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +27. [Python uninstall a module installed with pip install and virtual envirornment](https://www.youtube.com/watch?v=03ahRfkfwME&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +28. [python performance profiling in pycharm](https://www.youtube.com/watch?v=EZ-im7m8630&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +29. [Python Cumulative Sum per Group with Pandas](https://www.youtube.com/watch?v=1tCbvYv_ibw&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +30. [PyCharm - Breakpoints, Favorites, TODOs simple examples](https://www.youtube.com/watch?v=_fNZLrz97kg&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +31. [Python 3 simple ways to list files and folders](https://www.youtube.com/watch?v=oJdubyyJNIQ&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +32. [Python 3 elegant way to find most/less common element in a list](https://www.youtube.com/watch?v=P4LonC3puS4&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +33. [clock angle problem final](https://www.youtube.com/watch?v=eIRhXharV7k&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +34. [Python 3 List Comprehension Tutorial for beginners](https://www.youtube.com/watch?v=DmSephyJNtQ&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +35. [python 3 how to remove white spaces](https://www.youtube.com/watch?v=0k0fvqikaoE&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +36. [Pandas Tutorial : How to split dataframe by string or date](https://www.youtube.com/watch?v=7sgDvC4k6Xg&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +37. [improve your programming skills with fun](https://www.youtube.com/watch?v=uoAV7651Op0&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +38. [pandas dataframe search for string in all columns filter regex](https://www.youtube.com/watch?v=vbHFIALhSWE&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +39. [Pandas is column part of another column in the same row of dataframe](https://www.youtube.com/watch?v=duOHHDqI40c&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +40. [Easily extract tables from websites with pandas and python](https://www.youtube.com/watch?v=OXA_ZD1gR6A&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +41. [Easily extract information from excel with Python and Pandas](https://www.youtube.com/watch?v=hJMH_1o8eU0&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +42. [Python asterisk argument or What is the usage of * asterisk in Python](https://www.youtube.com/watch?v=JBm8iptLnuA&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +43. [Easy Image validation with Python - valid image, blank or pattern](https://www.youtube.com/watch?v=HMB4zrP_-HY&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +44. [Pandas DataFrame generate n-level hierarchical JSON](https://www.youtube.com/watch?v=lCcE-0bykRU&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +45. [Python group or sort list of lists by common element](https://www.youtube.com/watch?v=zVQJQxpedm8&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +46. [Think Python: Chapter 3 Functions 3.2](https://www.youtube.com/watch?v=Ol3Dwucax9U&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +47. [Questions and Answers 1 Improve OCR and tabula range](https://www.youtube.com/watch?v=nrF_Rgh88no&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) +48. [Map the headers to a column with pandas?](https://www.youtube.com/watch?v=3g6KG_8zq0E&list=PLeicpQTG639HMut5w0WfLz684cSCMBD4C) diff --git a/notebooks/Books/Think Python/Chapter_3_Functions_1.ipynb b/notebooks/Books/Think Python/Chapter_3_Functions_1.ipynb new file mode 100644 index 0000000..dca266f --- /dev/null +++ b/notebooks/Books/Think Python/Chapter_3_Functions_1.ipynb @@ -0,0 +1,874 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Think Python: How to Think Like a Computer Scientist" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Chapter 3 Functions\n", + "\n", + "* Function calls\n", + "* Math functions\n", + "* Composition\n", + "* Adding new functions\n", + "* Definitions and uses\n", + "* Flow of execution\n", + "* Parameters and arguments\n", + "------\n", + "* Variables and parameters are local\n", + "* Stack diagrams\n", + "* Fruitful functions and void functions\n", + "* Why functions?\n", + "* Debugging\n", + "* Glossary\n", + "* Exercises\n", + "\n", + "\n", + "> In the context of programming, a function is a named sequence of statements that performs a computation. When you define a function, you specify the name and the sequence of statements. Later, you can “call” the function by name." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Functions best practices\n", + "\n", + "* is name proper for the functionality\n", + "* it should do one thing and only one thing.\n", + "* has documentation\n", + "* relatively short one" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.1 Function calls" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# type is the function name\n", + "# 42 is the argument\n", + "\n", + "type('a')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> a function “takes” an argument and “returns” a result" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "32" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "int('32')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "invalid literal for int() with base 10: 'Hello'", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Hello'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mValueError\u001b[0m: invalid literal for int() with base 10: 'Hello'" + ], + "output_type": "error" + } + ], + "source": [ + "int('Hello')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "int(3.99999)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "-2" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "int(-2.3)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "32.0" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "float(32)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3.14159" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "float('3.14159')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'3.14159'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "str(3.14159)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'32'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "str(32)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.2 Math functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Python has a math module that provides most of the familiar mathematical functions. A module is a file that contains a collection of related functions." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import math\n", + "math" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> This format is called dot notation." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.2184874961635637" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# This example uses math.log10 to compute a signal-to-noise ratio in decibels \n", + "\n", + "signal_power = 5\n", + "noise_power = 3\n", + "ratio = signal_power / noise_power\n", + "decibels = 10 * math.log10(ratio)\n", + "decibels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#The second example finds the sine of radians. The name of the variable is a \n", + "# hint that sin and the other trigonometric functions (cos, tan, etc.) take arguments in radians. \n", + "# To convert from degrees to radians, divide by 180 and multiply by π:\n", + "\n", + "radians = 0.7\n", + "height = math.sin(radians)\n", + "height" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The expression math.pi gets the variable pi from the math module. Its value is a \n", + "# floating-point approximation of π, accurate to about 15 digits.\n", + "\n", + "degrees = 45\n", + "radians = degrees / 180.0 * math.pi\n", + "math.sin(radians)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# verify the previous result by\n", + "\n", + "math.sqrt(2) / 2.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> add meaningful and descriptive comments to your functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.3 Composition" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> One of the most useful features of programming languages is their ability to take small building blocks and compose them." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'degrees' is not defined", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdegrees\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0;36m360.0\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0;36m2\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mmath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpi\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'degrees' is not defined" + ], + "output_type": "error" + } + ], + "source": [ + "x = math.sin(degrees / 360.0 * 2 * math.pi)\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.01745240643728351" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = math.sin(1 / 360.0 * 2 * math.pi)\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = math.exp(math.log(x+1))\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "600" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hours = 10\n", + "minutes = hours * 60\n", + "minutes" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "can't assign to operator (, line 1)", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m hours * 60 = minutes\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m can't assign to operator\n" + ], + "output_type": "error" + } + ], + "source": [ + "hours * 60 = minutes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> avoid confusing and misleading compositions\n", + "\n", + "> keep to the KISS principle - keep it simple, stupid" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.4 Adding new functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> A function definition specifies the name of a new function and the sequence of statements that run when the function is called." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# def - is a keyword that indicates that this is a function definition\n", + "# print_lyrics - the function name\n", + "# () - indicate that this function doesn’t take any arguments.\n", + "\n", + "def print_lyrics():\n", + " print(\"I'm a lumberjack, and I'm okay.\")\n", + " print(\"I sleep all night and I work all day.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> The first line of the function definition is called the header; the rest is called the body. \n", + "\n", + "> Single quotes and double quotes do the same thing in most situations;" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "function" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(print_lyrics)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "print(print_lyrics)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> The syntax for calling the new function is the same as for built-in functions:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I'm a lumberjack, and I'm okay.\n", + "I sleep all night and I work all day.\n" + ] + } + ], + "source": [ + "print_lyrics()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "def repeat_lyrics():\n", + " print_lyrics()\n", + " print_lyrics()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I'm a lumberjack, and I'm okay.\n", + "I sleep all night and I work all day.\n", + "I'm a lumberjack, and I'm okay.\n", + "I sleep all night and I work all day.\n" + ] + } + ], + "source": [ + "repeat_lyrics()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.5 Definitions and uses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> This program contains two function definitions: print_lyrics and repeat_lyrics. Function definitions get executed just like other statements, but the effect is to create function objects.\n", + "\n", + "> You have to create a function before you can run it. In other words, the function definition has to run before the function gets called." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I'm a lumberjack, and I'm okay.\n", + "I sleep all night and I work all day.\n", + "I'm a lumberjack, and I'm okay.\n", + "I sleep all night and I work all day.\n" + ] + } + ], + "source": [ + "def print_lyrics():\n", + " print(\"I'm a lumberjack, and I'm okay.\")\n", + " print(\"I sleep all night and I work all day.\")\n", + "\n", + "def repeat_lyrics():\n", + " print_lyrics()\n", + " print_lyrics()\n", + "\n", + "repeat_lyrics()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'repeat_lyrics_new' is not defined", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mrepeat_lyrics_new\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrepeat_lyrics_new\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mprint_lyrics\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mprint_lyrics\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'repeat_lyrics_new' is not defined" + ], + "output_type": "error" + } + ], + "source": [ + "repeat_lyrics_new()\n", + "\n", + "def repeat_lyrics_new():\n", + " print_lyrics()\n", + " print_lyrics()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.6 Flow of execution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> To ensure that a function is defined before its first use, you have to know the order statements run in, which is called the flow of execution.\n", + "\n", + "> Execution always begins at the first statement of the program. Statements are run one at a time, in order from top to bottom.\n", + "\n", + "> In summary, when you read a program, you don’t always want to read from top to bottom. Sometimes it makes more sense if you follow the flow of execution.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n", + "3\n", + "2\n" + ] + } + ], + "source": [ + "def print_lyrics_1():\n", + " print(\"1\")\n", + "\n", + "def print_lyrics_2():\n", + " print(\"2\") \n", + " \n", + "def print_lyrics_3():\n", + " print(\"3\")\n", + "\n", + "def repeat_lyrics():\n", + " print_lyrics_1()\n", + " print_lyrics_3()\n", + " print_lyrics_2()\n", + "\n", + "repeat_lyrics()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.7 Parameters and arguments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Some of the functions we have seen require arguments. For example, when you call math.sin you pass a number as an argument. Some functions take more than one argument: math.pow takes two, the base and the exponent.\n", + "\n", + "> Inside the function, the arguments are assigned to variables called parameters. " + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "def print_twice(bruce):\n", + " print(bruce)\n", + " print(bruce)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Spam\n", + "Spam\n", + "42\n", + "42\n", + "3.141592653589793\n", + "3.141592653589793\n" + ] + } + ], + "source": [ + "print_twice('Spam')\n", + "print_twice(42)\n", + "print_twice(math.pi)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Spam Spam Spam Spam \n", + "Spam Spam Spam Spam \n" + ] + } + ], + "source": [ + "print_twice('Spam '*4)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-1.0\n", + "-1.0\n" + ] + } + ], + "source": [ + "print_twice(math.cos(math.pi))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> The argument is evaluated before the function is called, so in the examples the expressions 'Spam '*4 and math.cos(math.pi) are only evaluated once\n", + "\n", + "> The name of the variable we pass as an argument (michael) has nothing to do with the name of the parameter (bruce)." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Eric, the half a bee.\n", + "Eric, the half a bee.\n" + ] + } + ], + "source": [ + "michael = 'Eric, the half a bee.'\n", + "print_twice(michael)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Chapter_4__Case_study_interface_design.ipynb b/notebooks/Books/Think Python/Chapter_4__Case_study_interface_design.ipynb new file mode 100644 index 0000000..217df97 --- /dev/null +++ b/notebooks/Books/Think Python/Chapter_4__Case_study_interface_design.ipynb @@ -0,0 +1,618 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Chapter 4 Case study: interface design\n", + "\n", + "> This chapter presents a case study that demonstrates a process for designing functions that work together.\n", + "\n", + "\n", + "\n", + "* The turtle module\n", + "* Simple repetition\n", + "* Exercises\n", + "* **Encapsulation**\n", + "* **Generalization**\n", + "* **Interface design**\n", + "* **Refactoring**\n", + "* **A development plan**\n", + "* **docstring**\n", + "* Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.1 The turtle module" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'0.23.4'" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas\n", + "pandas.__version__" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import turtle" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> The turtle module (with a lowercase ’t’) provides a function called Turtle (with an uppercase ’T’) that creates a Turtle object, which we assign to a variable named bob. Printing bob displays something like:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "# mypolygon.py\n", + "import turtle\n", + "bob = turtle.Turtle()\n", + "print(bob)\n", + "turtle.mainloop()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# draw a right angle\n", + "import turtle\n", + "bob = turtle.Turtle()\n", + "bob.fd(100)\n", + "bob.lt(90)\n", + "bob.fd(100)\n", + "turtle.mainloop()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# A method is similar to a function, but it uses slightly different syntax. \n", + "import turtle\n", + "bob = turtle.Turtle()\n", + "bob.fd(100)\n", + "bob.lt(90)\n", + "bob.fd(100)\n", + "turtle.mainloop()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* A **function** is a piece of code that is called by name. It can be passed data to operate on (i.e. the parameters) and can optionally return data (the return value). All data that is passed to a function is explicitly passed.

\n", + "\n", + "* A **method** is a piece of code that is called by a name **that is associated with an object**. In most respects it is identical to a function except for two key differences:\n", + " * A method is implicitly passed the object on which it was called.\n", + " * A method is able to operate on data that is contained within the class" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.2 Simple repetition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# square\n", + "import turtle\n", + "bob3 = turtle.Turtle()\n", + "\n", + "bob3.fd(100)\n", + "bob3.lt(90)\n", + "\n", + "bob3.fd(100)\n", + "bob3.lt(90)\n", + "\n", + "bob3.fd(100)\n", + "bob3.lt(90)\n", + "\n", + "bob3.fd(100)\n", + "\n", + "turtle.mainloop()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> A for statement is also called a loop because the flow of execution runs through the body and then loops back to the top" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hello!\n", + "Hello!\n", + "Hello!\n", + "Hello!\n" + ] + } + ], + "source": [ + "for i in range(4):\n", + " print('Hello!')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# square \n", + "import turtle\n", + "bob = turtle.Turtle()\n", + "for i in range(4):\n", + " bob.fd(100)\n", + " bob.lt(90)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Did you notice the difference between both programs?\n", + "\n", + "**The art of cognitive blindspots | Kyle Eschen**\n", + "\n", + "https://www.youtube.com/watch?reload=9&v=OOG65rSM5fA" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.3 Exercises\n", + "\n", + "1. Write a function called square that takes a parameter named t, which is a turtle. It should use the turtle to draw a square.\n", + "Write a function call that passes bob as an argument to square, and then run the program again.

\n", + "\n", + "2. Add another parameter, named length, to square. Modify the body so length of the sides is length, and then modify the function call to provide a second argument. Run the program again. Test your program with a range of values for length.

\n", + "\n", + "3. Make a copy of square and change the name to polygon. Add another parameter named n and modify the body so it draws an n-sided regular polygon. Hint: The exterior angles of an n-sided regular polygon are 360/n degrees.

\n", + "\n", + "4. Write a function called circle that takes a turtle, t, and radius, r, as parameters and that draws an approximate circle by calling polygon with an appropriate length and number of sides. Test your function with a range of values of r.\n", + "Hint: figure out the circumference of the circle and make sure that length * n = circumference.

\n", + "\n", + "5. Make a more general version of circle called arc that takes an additional parameter angle, which determines what fraction of a circle to draw. angle is in units of degrees, so when angle=360, arc should draw a complete circle." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.4 Encapsulation\n", + "\n", + "> Wrapping a piece of code up in a function is called encapsulation. \n", + "\n", + "The major advantages: \n", + "* code re-use\n", + "* shorter programs (it is more concise to call a function twice than to copy and paste the body)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# square \n", + "import turtle\n", + "\n", + "def square(t):\n", + " for i in range(4):\n", + " t.fd(100)\n", + " t.lt(90)\n", + "\n", + "bob = turtle.Turtle()\n", + "square(bob)\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> The innermost statements, fd and lt are indented twice to show that they are inside the for loop, which is inside the function definition. The next line, square(bob), is flush with the left margin, which indicates the end of both the for loop and the function definition." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Inside the function, t refers to the same turtle bob, so t.lt(90) has the same effect as bob.lt(90). In that case, why not call the parameter bob? " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.5 Generalization\n", + "\n", + "> Adding a parameter to a function is called generalization because it makes the function more general: in the previous version, the square is always the same size; in this version it can be any size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add a length parameter to square. \n", + "import turtle\n", + "\n", + "def square(t, length):\n", + " for i in range(4):\n", + " t.fd(length)\n", + " t.lt(90)\n", + "\n", + "\n", + "\n", + "bob = turtle.Turtle()\n", + "square(bob, 100)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Instead of drawing squares, polygon draws regular polygons with any number of sides.\n", + "import turtle\n", + "\n", + "def polygon(t, n, length):\n", + " angle = 360 / n\n", + " for i in range(n):\n", + " t.fd(length)\n", + " t.lt(angle)\n", + "\n", + "bob = turtle.Turtle()\n", + "polygon(bob, 21, 70)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> When a function has more than a few numeric arguments, it is easy to forget what they are, or what order they should be in. In that case it is often a good idea to include the names of the parameters in the argument list:\n", + "\n", + "```python\n", + "polygon(bob, n=7, length=70)```\n", + "\n", + "> These are called keyword arguments because they include the parameter names as “keywords” (not to be confused with Python keywords like while and def)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.6 Interface design\n", + "\n", + "> The interface of a function is a summary of how it is used: \n", + "\n", + "* what are the parameters? \n", + "* What does the function do? \n", + "* And what is the return value? \n", + "\n", + "> An interface is “clean” if it allows the caller to do what they want without dealing with unnecessary details.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The next step is to write circle, which takes a radius, r, as a parameter. \n", + "import turtle\n", + "import math\n", + "\n", + "def polygon(t, n, length):\n", + " angle = 360 / n\n", + " for i in range(n):\n", + " t.fd(length)\n", + " t.lt(angle)\n", + "\n", + "def circle(t, r):\n", + " circumference = 2 * math.pi * r\n", + " n = 50\n", + " length = circumference / n\n", + " polygon(t, n, length)\n", + "\n", + "bob = turtle.Turtle()\n", + "circle(bob, 75)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# One limitation of this solution is that n is a constant,\n", + "import turtle\n", + "import math\n", + "\n", + "def polygon(t, n, length):\n", + " angle = 360 / n\n", + " for i in range(n):\n", + " t.fd(length)\n", + " t.lt(angle)\n", + "\n", + "def circle(t, r):\n", + " circumference = 2 * math.pi * r\n", + " n = int(circumference / 3) + 3\n", + " length = circumference / n\n", + " polygon(t, n, length)\n", + "\n", + "bob = turtle.Turtle()\n", + "circle(bob, 75)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.7 Refactoring\n", + "\n", + "> This process—rearranging a program to improve interfaces and facilitate code re-use—is called refactoring. In this case, we noticed that there was similar code in arc and polygon, so we “factored it out” into polyline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# copy of polygon and transform it into arc\n", + "import turtle\n", + "import math\n", + "\n", + "def arc(t, r, angle):\n", + " arc_length = 2 * math.pi * r * angle / 360\n", + " n = int(arc_length / 3) + 1\n", + " step_length = arc_length / n\n", + " step_angle = angle / n\n", + " \n", + " for i in range(n):\n", + " t.fd(step_length)\n", + " t.lt(step_angle)\n", + "\n", + "bob = turtle.Turtle()\n", + "arc(bob, 100, 180)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# general function polyline\n", + "# rewrite polygon and arc to use polyline\n", + "\n", + "import turtle\n", + "import math\n", + "\n", + "def polyline(t, n, length, angle):\n", + " for i in range(n):\n", + " t.fd(length)\n", + " t.lt(angle)\n", + "\n", + "def polygon(t, n, length):\n", + " angle = 360.0 / n\n", + " polyline(t, n, length, angle)\n", + "\n", + "def arc(t, r, angle):\n", + " arc_length = 2 * math.pi * r * angle / 360\n", + " n = int(arc_length / 3) + 1\n", + " step_length = arc_length / n\n", + " step_angle = float(angle) / n\n", + " polyline(t, n, step_length, step_angle)\n", + " \n", + "def circle(t, r):\n", + " arc(t, r, 360)\n", + "\n", + "bob = turtle.Turtle()\n", + "arc(bob, 100, 180)\n", + "\n", + "turtle.done()\n", + "\n", + "import os\n", + "os._exit(00)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.8 A development plan\n", + "\n", + "1. Start by writing a small program with no function definitions.

\n", + "2. Once you get the program working, identify a coherent piece of it, encapsulate the piece in a function and give it a name.

\n", + "3. Generalize the function by adding appropriate parameters.

\n", + "4. Repeat steps 1–3 until you have a set of working functions. Copy and paste working code to avoid retyping (and re-debugging).

\n", + "5. Look for opportunities to improve the program by refactoring. For example, if you have similar code in several places, consider factoring it into an appropriately general function.

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.9 docstring\n", + "\n", + "> A docstring is a string at the beginning of a function that explains the interface (“doc” is short for “documentation”)." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "polyline\n", + "square\n" + ] + } + ], + "source": [ + "import turtle\n", + "\n", + "def polyline():\n", + " \"\"\"Draws n line segments with the given length and\n", + " angle (in degrees) between them. t is a turtle.\n", + " \"\"\" \n", + " print('polyline')\n", + " #for i in range(n):\n", + " # t.fd(length)\n", + " # t.lt(angle)\n", + " \n", + "def square():\n", + " print('square')\n", + " \n", + "polyline() \n", + "square()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4.10 Debugging\n", + "\n", + "> If the preconditions are satisfied and the postconditions are not, the bug is in the function. If your pre- and postconditions are clear, they can help with debugging." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Chapter_5__Conditionals_and_recursion.ipynb b/notebooks/Books/Think Python/Chapter_5__Conditionals_and_recursion.ipynb new file mode 100644 index 0000000..73f5439 --- /dev/null +++ b/notebooks/Books/Think Python/Chapter_5__Conditionals_and_recursion.ipynb @@ -0,0 +1,1682 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 5 Conditionals and recursion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* Modulus operator\n", + "* Boolean expressions\n", + "* Logical operators\n", + "* Conditional and Alternative execution\n", + "* Chained and Nested conditionals\n", + "* Recursion and Infinite recursion\n", + "* Keyboard input\n", + "* Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.1 Floor division and modulus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The main topic of this chapter is the if statement, which\n", + "executes different code depending on the state of the program.\n", + "But first I want to introduce two new operators: floor division\n", + "and modulus." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The floor division operator, //, divides\n", + "two numbers and rounds down to an integer. For example, suppose the\n", + "run time of a movie is 105 minutes. You might want to know how\n", + "long that is in hours. Conventional division\n", + "returns a floating-point number:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1.75" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "minutes = 105\n", + "minutes / 60" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But we don’t normally write hours with decimal points. Floor\n", + "division returns the integer number of hours, rounding down:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "minutes = 105\n", + "hours = minutes // 60\n", + "hours" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get the remainder, you could subtract off one hour in minutes:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "45" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "remainder = minutes - hours * 60\n", + "remainder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An alternative is to use the modulus operator, %, which\n", + "divides two numbers and returns the remainder." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "45" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "remainder = minutes % 60\n", + "remainder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The modulus operator is more useful than it seems. For\n", + "example, you can check whether one number is divisible by another—if\n", + "x % y is zero, then x is divisible by y.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also, you can extract the right-most digit\n", + "or digits from a number. For example, x % 10 yields the\n", + "right-most digit of x (in base 10). Similarly x % 100\n", + "yields the last two digits." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you are using Python 2, division works differently. The\n", + "division operator, /, performs floor division if both\n", + "operands are integers, and floating-point division if either\n", + "operand is a float.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.2 Boolean expressions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A boolean expression is an expression that is either true\n", + "or false. The following examples use the \n", + "operator ==, which compares two operands and produces\n", + "True if they are equal and False otherwise:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "5 == 5" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "5 == 6" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "True and False are special\n", + "values that belong to the type bool; they are not strings:\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bool" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(True)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bool" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(False)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type('True') ## Question?" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type('true')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The == operator is one of the relational operators; the\n", + "others are:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x != y # x is not equal to y\n", + "x > y # x is greater than y\n", + "x < y # x is less than y\n", + "x >= y # x is greater than or equal to y\n", + "x <= y # x is less than or equal to y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Although these operations are probably familiar to you, the Python\n", + "symbols are different from the mathematical symbols. A common error\n", + "is to use a single equal sign (=) instead of a double equal sign\n", + "(==). Remember that = is an assignment operator and\n", + "== is a relational operator. There is no such thing as\n", + "=< or =>.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.3 Logical operators" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are three logical operators: and, or, and not. The semantics (meaning) of these operators is\n", + "similar to their meaning in English. For example,\n", + "x > 0 and x < 10 is true only if x is greater than 0\n", + "and less than 10.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = 5\n", + "x > 0 and x < 10" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = 15\n", + "x > 0 and x < 10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "n%2 == 0 or n%3 == 0 is true if either or both of the\n", + "conditions is true, that is, if the number is divisible by 2 or\n", + "3." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "False True False\n", + "False False True\n", + "True True True\n", + "False False False\n" + ] + } + ], + "source": [ + "for n in [4,9,6, 7]:\n", + " print(n%2 == 0 and n%3 == 0, n%2 == 0, n%3 == 0 )" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True True False\n", + "True False True\n", + "True True True\n", + "False False False\n" + ] + } + ], + "source": [ + "for n in [4,9,6,7]:\n", + " print(n%2 == 0 or n%3 == 0, n%2 == 0, n%3 == 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, the not operator negates a boolean\n", + "expression, so not (x > y) is true if x > y is false,\n", + "that is, if x is less than or equal to y." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "not True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Strictly speaking, the operands of the logical operators should be\n", + "boolean expressions, but Python is not very strict.\n", + "Any nonzero number is interpreted as True:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "42 and True" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "0 and True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This flexibility can be useful, but there are some subtleties to\n", + "it that might be confusing. You might want to avoid it (unless\n", + "you know what you are doing)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Bonus: Boolean algebra and Truth table\n", + "\n", + "* https://en.wikipedia.org/wiki/Boolean_algebra\n", + "* https://en.wikipedia.org/wiki/Truth_table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.4 Conditional execution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "In order to write useful programs, we almost always need the ability\n", + "to check conditions and change the behavior of the program\n", + "accordingly. Conditional statements give us this ability. The\n", + "simplest form is the if statement:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "x is positive\n" + ] + } + ], + "source": [ + "x = 42\n", + "if x > 0:\n", + " print('x is positive')" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1 is positive\n", + "4 is positive\n" + ] + } + ], + "source": [ + "for x in [1, -2, 4]: ## Question?\n", + " if x > 0:\n", + " print(f'{x} is positive')" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "ename": "TypeError", + "evalue": "'>' not supported between instances of 'str' and 'int'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;34m'5'\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m: '>' not supported between instances of 'str' and 'int'" + ] + } + ], + "source": [ + "'5' > 0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The boolean expression after if is\n", + "called the condition. If it is true, the indented\n", + "statement runs. If not, nothing happens.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "if statements have the same structure as function definitions:\n", + "a header followed by an indented body. Statements like this are\n", + "called compound statements." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is no limit on the number of statements that can appear in\n", + "the body, but there has to be at least one.\n", + "Occasionally, it is useful to have a body with no statements (usually\n", + "as a place keeper for code you haven’t written yet). In that\n", + "case, you can use the pass statement, which does nothing.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = -42\n", + "if x < 0:\n", + " pass # TODO: need to handle negative values!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.5 Alternative execution" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A second form of the if statement is “alternative execution”,\n", + "in which there are two possibilities and the condition determines\n", + "which one runs. The syntax looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "x is even\n" + ] + } + ], + "source": [ + "if x % 2 == 0:\n", + " print('x is even')\n", + "else:\n", + " print('x is odd')" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1 is odd\n", + "-2 is even\n", + "4 is even\n" + ] + } + ], + "source": [ + "# f-strings or string interpollation\n", + "\n", + "for x in [1, -2, 4]:\n", + " if x % 2 == 0:\n", + " print(f'{x} is even')\n", + " else:\n", + " print(f'{x} is odd')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the remainder when x is divided by 2 is 0, then we know that\n", + "x is even, and the program displays an appropriate message. If\n", + "the condition is false, the second set of statements runs.\n", + "Since the condition must be true or false, exactly one of the\n", + "alternatives will run. The alternatives are called branches, because they are branches in the flow of execution.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.6 Chained conditionals" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes there are more than two possibilities and we need more than\n", + "two branches. One way to express a computation like that is a chained conditional:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "x is less than y\n" + ] + } + ], + "source": [ + "y = 42\n", + "if x < y and 1:\n", + " print('x is less than y')\n", + "elif x > y:\n", + " print('x is greater than y')\n", + "else:\n", + " print('x and y are equal')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "elif is an abbreviation of “else if”. Again, exactly one\n", + "branch will run. There is no limit on the number of elif statements. If there is an else clause, it has to be\n", + "at the end, but there doesn’t have to be one.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if choice == 'a':\n", + " draw_a()\n", + "elif choice == 'b':\n", + " draw_b()\n", + "elif choice == 'c':\n", + " draw_c()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Each condition is checked in order. If the first is false,\n", + "the next is checked, and so on. If one of them is\n", + "true, the corresponding branch runs and the statement\n", + "ends. Even if more than one condition is true, only the\n", + "first true branch runs. " + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100\n" + ] + } + ], + "source": [ + "if x < 100: ## Question: What will be the output?\n", + " print('100')\n", + "elif x < 101:\n", + " print('101')\n", + "elif x < 102:\n", + " print('102')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.7 Nested conditionals" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One conditional can also be nested within another. We could have\n", + "written the example in the previous section like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if x == y:\n", + " print('x and y are equal')\n", + "else:\n", + " if x < y:\n", + " print('x is less than y')\n", + " else:\n", + " print('x is greater than y')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The outer conditional contains two branches. The\n", + "first branch contains a simple statement. The second branch\n", + "contains another if statement, which has two branches of its\n", + "own. Those two branches are both simple statements,\n", + "although they could have been conditional statements as well." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Although the indentation of the statements makes the structure\n", + "apparent, nested conditionals become difficult to read very\n", + "quickly. It is a good idea to avoid them when you can." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Logical operators often provide a way to simplify nested conditional\n", + "statements. For example, we can rewrite the following code using a\n", + "single conditional:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if 0 < x:\n", + " if x < 10:\n", + " print('x is a positive single-digit number.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The print statement runs only if we make it past both\n", + "conditionals, so we can get the same effect with the and operator:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if 0 < x and x < 10:\n", + " print('x is a positive single-digit number.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this kind of condition, Python provides a more concise option:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if 0 < x < 10:\n", + " print('x is a positive single-digit number.') ## Question?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.8 Recursion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is legal for one function to call another;\n", + "it is also legal for a function to call itself. It may not be obvious\n", + "why that is a good thing, but it turns out to be one of the most\n", + "magical things a program can do.\n", + "For example, look at the following function:" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "def countdown(n):\n", + " if n <= 0:\n", + " print('Blastoff!')\n", + " else:\n", + " print(n)\n", + " countdown(n-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If n is 0 or negative, it outputs the word, “Blastoff!”\n", + "Otherwise, it outputs n and then calls a function named countdown—itself—passing n-1 as an argument." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What happens if we call this function like this?" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3\n", + "2\n", + "1\n", + "Blastoff!\n" + ] + } + ], + "source": [ + "countdown(3) ## Question?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The execution of countdown begins with n=3, and since\n", + "n is greater than 0, it outputs the value 3, and then calls itself..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The countdown that got n=3 returns." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And then you’re back in __main__. So, the\n", + "total output looks like this:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A function that calls itself is recursive; the process of\n", + "executing it is called recursion.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As another example, we can write a function that prints a\n", + "string n times." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "def print_n(s, n):\n", + " if n <= 0:\n", + " return\n", + " print(s)\n", + " print_n(s, n-1)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "s\n", + "s\n" + ] + } + ], + "source": [ + "print_n('s', 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If n <= 0 the return statement exits the function. The\n", + "flow of execution immediately returns to the caller, and the remaining\n", + "lines of the function don’t run.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The rest of the function is similar to countdown: it displays\n", + "s and then calls itself to display s n−1 additional\n", + "times. So the number of lines of output is 1 + (n - 1), which\n", + "adds up to n." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For simple examples like this, it is probably easier to use a for loop. But we will see examples later that are hard to write\n", + "with a for loop and easy to write with recursion, so it is\n", + "good to start early.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.9 Stack diagrams for recursive functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Section 3.9, we used a stack diagram to represent\n", + "the state of a program during a function call. The same kind of\n", + "diagram can help interpret a recursive function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every time a function gets called, Python creates a\n", + "frame to contain the function’s local variables and parameters.\n", + "For a recursive function, there might be more than one frame on the\n", + "stack at the same time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Figure 5.1 shows a stack diagram for countdown called with\n", + "n = 3." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As usual, the top of the stack is the frame for __main__.\n", + "It is empty because we did not create any variables in \n", + "__main__ or pass any arguments to it.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The four countdown frames have different values for the\n", + "parameter n. The bottom of the stack, where n=0, is\n", + "called the base case. It does not make a recursive call, so\n", + "there are no more frames." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an exercise, draw a stack diagram for print_n called with\n", + "s = 'Hello' and n=2.\n", + "Then write a function called do_n that takes a function\n", + "object and a number, n, as arguments, and that calls\n", + "the given function n times." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.10 Infinite recursion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If a recursion never reaches a base case, it goes on making\n", + "recursive calls forever, and the program never terminates. This is\n", + "known as infinite recursion, and it is generally not\n", + "a good idea. Here is a minimal program with an infinite recursion:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "def recurse():\n", + " recurse()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In most programming environments, a program with infinite recursion\n", + "does not really run forever. Python reports an error\n", + "message when the maximum recursion depth is reached:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "ename": "RecursionError", + "evalue": "maximum recursion depth exceeded", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRecursionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mrecurse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m## Question?\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mrecurse\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrecurse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrecurse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "... last 1 frames repeated, from the frame below ...\n", + "\u001b[0;32m\u001b[0m in \u001b[0;36mrecurse\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mrecurse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mrecurse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mRecursionError\u001b[0m: maximum recursion depth exceeded" + ] + } + ], + "source": [ + "recurse() ## Question?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This traceback is a little bigger than the one we saw in the\n", + "previous chapter. When the error occurs, there are 1000\n", + "recurse frames on the stack!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you encounter an infinite recursion by accident, review\n", + "your function to confirm that there is a base case that does not\n", + "make a recursive call. And if there is a base case, check whether\n", + "you are guaranteed to reach it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.11 Keyboard input" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The programs we have written so far accept no input from the user.\n", + "They just do the same thing every time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python provides a built-in function called input that\n", + "stops the program and\n", + "waits for the user to type something. When the user presses Return or Enter, the program resumes and input\n", + "returns what the user typed as a string. In Python 2, the same\n", + "function is called raw_input.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "x\n" + ] + } + ], + "source": [ + "text = input()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'x'" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Before getting input from the user, it is a good idea to print a\n", + "prompt telling the user what to type. input can take a\n", + "prompt as an argument:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "What...is your name?\n", + "x\n" + ] + } + ], + "source": [ + "name = input('What...is your name?\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'x'" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "name" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The sequence \\n at the end of the prompt represents a newline, which is a special character that causes a line break.\n", + "That’s why the user’s input appears below the prompt. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you expect the user to type an integer, you can try to convert\n", + "the return value to int:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "What...is the airspeed velocity of an unladen swallow?\n", + "100\n" + ] + } + ], + "source": [ + "prompt = 'What...is the airspeed velocity of an unladen swallow?\\n'\n", + "speed = input(prompt)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'100'" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "speed" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But if the user types something other than a string of digits,\n", + "you get an error:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "speed = input(prompt)\n", + "What...is the airspeed velocity of an unladen swallow?\n", + "What do you mean, an African or a European swallow?\n", + "int(speed)\n", + "ValueError: invalid literal for int() with base 10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We will see how to handle this kind of error later.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "str" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(speed) ## Question?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.12 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When a syntax or runtime error occurs, the error message contains\n", + "a lot of information, but it can be overwhelming. The most\n", + "useful parts are usually:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Syntax errors are usually easy to find, but there are a few\n", + "gotchas. Whitespace errors can be tricky because spaces and\n", + "tabs are invisible and we are used to ignoring them.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "x = 5 ## Question?\n", + "y = 6" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In this example, the problem is that the second line is indented by\n", + "one space. But the error message points to y, which is\n", + "misleading. In general, error messages indicate where the problem was\n", + "discovered, but the actual error might be earlier in the code,\n", + "sometimes on a previous line.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same is true of runtime errors. Suppose you are trying\n", + "to compute a signal-to-noise ratio in decibels. The formula\n", + "is SNRdb = 10 log10 (Psignal / Pnoise). In Python,\n", + "you might write something like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "math domain error", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mnoise_power\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mratio\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msignal_power\u001b[0m \u001b[0;34m//\u001b[0m \u001b[0mnoise_power\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mdecibels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10\u001b[0m \u001b[0;34m*\u001b[0m \u001b[0mmath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlog10\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mratio\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdecibels\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mValueError\u001b[0m: math domain error" + ] + } + ], + "source": [ + "import math\n", + "signal_power = 9\n", + "noise_power = 10\n", + "ratio = signal_power // noise_power\n", + "decibels = 10 * math.log10(ratio)\n", + "print(decibels)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "When you run this program, you get an exception:\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The error message indicates line 5, but there is nothing\n", + "wrong with that line. To find the real error, it might be\n", + "useful to print the value of ratio, which turns out to\n", + "be 0. The problem is in line 4, which uses floor division\n", + "instead of floating-point division.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You should take the time to read error messages carefully, but don’t\n", + "assume that everything they say is correct." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5.13 Glossary" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Chapter_6__Fruitful_functions.ipynb b/notebooks/Books/Think Python/Chapter_6__Fruitful_functions.ipynb new file mode 100644 index 0000000..b47322b --- /dev/null +++ b/notebooks/Books/Think Python/Chapter_6__Fruitful_functions.ipynb @@ -0,0 +1,1699 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 6  Fruitful functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "* Return values\n", + "* Incremental development\n", + "* Composition\n", + "* Boolean functions\n", + "* More recursion\n", + "* Leap of faith" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.1 Return values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Many of the Python functions we have used, such as the math\n", + "functions, produce return values. But the functions we’ve written\n", + "are all void: they have an effect, like printing a value\n", + "or moving a turtle, but they don’t have a return value. In\n", + "this chapter you will learn to write fruitful functions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def print_str(s):\n", + " print(s)\n", + "print_str(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def print(s):\n", + " print(s)\n", + "print(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "del print" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def double_int(i):\n", + " return i * 2\n", + "double_int(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = print_str(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = double_int(1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(x, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Calling the function generates a return\n", + "value, which we usually assign to a variable or use as part of an\n", + "expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "e = math.exp(1.0)\n", + "height = radius * math.sin(radians)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "math.exp(1.0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import math \n", + "math.sin(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The functions we have written so far are void. Speaking casually,\n", + "they have no return value; more precisely,\n", + "their return value is None." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def print_me(s):\n", + " print(s)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_me('x')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = print_me('x')\n", + "y" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this chapter, we are (finally) going to write fruitful functions.\n", + "The first example is area, which returns the area of a circle\n", + "with the given radius:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def area(radius):\n", + " a = math.pi * radius**2\n", + " return a" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We have seen the return statement before, but in a fruitful\n", + "function the return statement includes\n", + "an expression. This statement means: “Return immediately from\n", + "this function and use the following expression as a return value.”\n", + "The expression can be arbitrarily complicated, so we could\n", + "have written this function more concisely:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def area(radius):\n", + " return math.pi * radius**2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "On the other hand, temporary variables like a can make\n", + "debugging easier.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes it is useful to have multiple return statements, one in each\n", + "branch of a conditional:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def absolute_value(x):\n", + " if x < 0:\n", + " return -x\n", + " else:\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Since these return statements are in an alternative conditional,\n", + "only one runs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As soon as a return statement runs, the function\n", + "terminates without executing any subsequent statements.\n", + "Code that appears after a return statement, or any other place\n", + "the flow of execution can never reach, is called dead code.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def area_x(radius):\n", + " return 0\n", + " print('x')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "area_x(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In a fruitful function, it is a good idea to ensure\n", + "that every possible path through the program hits a\n", + "return statement. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def absolute_value(x):\n", + " if x < 0:\n", + " return -x\n", + " if x > 0:\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This function is incorrect because if x happens to be 0,\n", + "neither condition is true, and the function ends without hitting a\n", + "return statement. If the flow of execution gets to the end\n", + "of a function, the return value is None, which is not\n", + "the absolute value of 0.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(absolute_value(0))\n", + "None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "By the way, Python provides a built-in function called \n", + "abs that computes absolute values.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an **exercise**, write a compare function that\n", + "takes two values, x and y, and returns 1 if x > y,\n", + "0 if x == y, and -1 if x < y.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Bonus** You can return more tham one variable from a function by using list/tuple etc" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def area_y(radius):\n", + " return 0, 1, 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "area_y(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x,y,z = area_y(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.2 Incremental development" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: Have a clear \n", + "* use case\n", + "* specifications\n", + "* test results/case:\n", + "\n", + "> these values so that the horizontal distance is 3 and the\n", + "vertical distance is 4; that way, the result is 5, the hypotenuse \n", + "of a 3-4-5 triangle." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you write larger functions, you might find yourself\n", + "spending more time debugging." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To deal with increasingly complex programs,\n", + "you might want to try a process called\n", + "incremental development. The goal of incremental development\n", + "is to avoid long debugging sessions by adding and testing only\n", + "a small amount of code at a time.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an example, suppose you want to find the distance between two\n", + "points, given by the coordinates (x1, y1) and (x2, y2).\n", + "By the Pythagorean theorem, the distance is:\n", + "\n", + "distance = \t√(x2 − x1)2 + (y2 − y1)2\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first step is to consider what a distance function should\n", + "look like in Python. In other words, what are the inputs (parameters)\n", + "and what is the output (return value)?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, the inputs are two points, which you can represent\n", + "using four numbers. The return value is the distance represented by\n", + "a floating-point value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Immediately you can write an outline of the function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def distance(x1, y1, x2, y2):\n", + " return 0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Obviously, this version doesn’t compute distances; it always returns\n", + "zero. But it is syntactically correct, and it runs, which means that\n", + "you can test it before you make it more complicated." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To test the new function, call it with sample arguments:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "distance(1, 2, 4, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "I chose these values so that the horizontal distance is 3 and the\n", + "vertical distance is 4; that way, the result is 5, the hypotenuse \n", + "of a 3-4-5 triangle. When testing a function, it is\n", + "useful to know the right answer.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point we have confirmed that the function is syntactically\n", + "correct, and we can start adding code to the body.\n", + "A reasonable next step is to find the differences\n", + "x2 − x1 and y2 − y1. The next version stores those values in\n", + "temporary variables and prints them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def distance(x1, y1, x2, y2):\n", + " dx = x2 - x1\n", + " dy = y2 - y1\n", + " print('dx is', dx)\n", + " print('dy is', dy)\n", + " return 0.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "distance(1, 2, 4, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the function is working, it should display dx is 3 and \n", + "dy is 4. If so, we know that the function is getting the right\n", + "arguments and performing the first computation correctly. If not,\n", + "there are only a few lines to check." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we compute the sum of squares of dx and dy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def distance(x1, y1, x2, y2):\n", + " dx = x2 - x1\n", + " dy = y2 - y1\n", + " dsquared = dx**2 + dy**2\n", + " print('dsquared is: ', dsquared)\n", + " return 0.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "distance(1, 2, 4, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Again, you would run the program at this stage and check the output\n", + "(which should be 25).\n", + "Finally, you can use math.sqrt to compute and return the result:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def distance(x1, y1, x2, y2):\n", + " dx = x2 - x1\n", + " dy = y2 - y1\n", + " dsquared = dx**2 + dy**2\n", + " result = math.sqrt(dsquared)\n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "distance(1, 2, 4, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If that works correctly, you are done. Otherwise, you might\n", + "want to print the value of result before the return\n", + "statement." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The final version of the function doesn’t display anything when it\n", + "runs; it only returns a value. The print statements we wrote\n", + "are useful for debugging, but once you get the function working, you\n", + "should remove them. Code like that is called scaffolding\n", + "because it is helpful for building the program but is not part of the\n", + "final product.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you start out, you should add only a line or two of code at a\n", + "time. As you gain more experience, you might find yourself writing\n", + "and debugging bigger chunks. Either way, incremental development\n", + "can save you a lot of debugging time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The key aspects of the process are:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an exercise, use incremental development to write a function\n", + "called hypotenuse that returns the length of the hypotenuse of a\n", + "right triangle given the lengths of the other two legs as arguments.\n", + "Record each stage of the development process as you go.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.3 Composition" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you should expect by now, you can call one function from within\n", + "another. As an example, we’ll write a function that takes two points,\n", + "the center of the circle and a point on the perimeter, and computes\n", + "the area of the circle." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Assume that the center point is stored in the variables xc and\n", + "yc, and the perimeter point is in xp and yp. The\n", + "first step is to find the radius of the circle, which is the distance\n", + "between the two points. We just wrote a function, distance, that does that:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "radius = distance(1, 2, 4, 6)\n", + "radius" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The next step is to find the area of a circle with that radius;\n", + "we just wrote that, too:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = area(radius)\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Encapsulating these steps in a function, we get:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def circle_area(xc, yc, xp, yp): # 1, 2, 4, 6\n", + " radius = distance(xc, yc, xp, yp) # 5\n", + " result = area(radius) # 78.539\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The temporary variables radius and result are useful for\n", + "development and debugging, but once the program is working, we can\n", + "make it more concise by composing the function calls:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def circle_area(xc, yc, xp, yp):\n", + " return area(distance(xc, yc, xp, yp))\n", + "circle_area(1, 2, 4, 6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.4 Boolean functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Functions can return booleans, which is often convenient for hiding\n", + "complicated tests inside functions. \n", + "For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_divisible(x, y):\n", + " if x % y == 0:\n", + " return True\n", + " else:\n", + " return False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "It is common to give boolean functions names that sound like yes/no\n", + "questions; is_divisible returns either True or False\n", + "to indicate whether x is divisible by y." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_divisible(6, 4)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_divisible(6, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_divisible(0, 0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result of the == operator is a boolean, so we can write the\n", + "function more concisely by returning it directly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_divisible(x, y):\n", + " return x % y == 0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Boolean functions are often used in conditional statements:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if is_divisible(x, y):\n", + " print('x is divisible by y')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "It might be tempting to write something like:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if is_divisible(x, y) == True:\n", + " print('x is divisible by y')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But the extra comparison is unnecessary." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an exercise, write a function is_between(x, y, z) that\n", + "returns True if x ≤ y ≤ z or False otherwise." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.5 More recursion" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have only covered a small subset of Python, but you might\n", + "be interested to know that this subset is a complete\n", + "programming language, which means that anything that can be\n", + "computed can be expressed in this language. Any program ever written\n", + "could be rewritten using only the language features you have learned\n", + "so far (actually, you would need a few commands to control devices\n", + "like the mouse, disks, etc., but that’s all)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Proving that claim is a nontrivial exercise first accomplished by Alan\n", + "Turing, one of the first computer scientists (some would argue that he\n", + "was a mathematician, but a lot of early computer scientists started as\n", + "mathematicians). Accordingly, it is known as the Turing Thesis.\n", + "For a more complete (and accurate) discussion of the Turing Thesis,\n", + "I recommend Michael Sipser’s book Introduction to the\n", + "Theory of Computation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To give you an idea of what you can do with the tools you have learned\n", + "so far, we’ll evaluate a few recursively defined mathematical\n", + "functions. A recursive definition is similar to a circular\n", + "definition, in the sense that the definition contains a reference to\n", + "the thing being defined. A truly circular definition is not very\n", + "useful:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you saw that definition in the dictionary, you might be annoyed. On\n", + "the other hand, if you looked up the definition of the factorial\n", + "function, denoted with the symbol !, you might get something like\n", + "this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "0! = 1 \n", + "n! = n (n−1)!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This definition says that the factorial of 0 is 1, and the factorial\n", + "of any other value, n, is n multiplied by the factorial of n−1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So 3! is 3 times 2!, which is 2 times 1!, which is 1 times\n", + "0!. Putting it all together, 3! equals 3 times 2 times 1 times 1,\n", + "which is 6.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you can write a recursive definition of something, you can\n", + "write a Python program to evaluate it. The first step is to decide\n", + "what the parameters should be. In this case it should be clear\n", + "that factorial takes an integer:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def factorial(n):" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the argument happens to be 0, all we have to do is return 1:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def factorial(n):\n", + " if n == 0:\n", + " return 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Otherwise, and this is the interesting part, we have to make a\n", + "recursive call to find the factorial of n−1 and then multiply it by\n", + "n:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def factorial(n):\n", + " if n == 0:\n", + " return 1\n", + " else:\n", + " recurse = factorial(n-1)\n", + " result = n * recurse\n", + " return result" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "720" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "factorial(6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The flow of execution for this program is similar to the flow of countdown in Section 5.8. If we call factorial\n", + "with the value 3:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since 3 is not 0, we take the second branch and calculate the factorial\n", + "of n-1..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The return value (2) is multiplied by n, which is 3, and the result, 6,\n", + "becomes the return value of the function call that started the whole\n", + "process.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Figure 6.1 shows what the stack diagram looks like for\n", + "this sequence of function calls." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The return values are shown being passed back up the stack. In each\n", + "frame, the return value is the value of result, which is the\n", + "product of n and recurse.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the last frame, the local\n", + "variables recurse and result do not exist, because\n", + "the branch that creates them does not run." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.6 Leap of faith\n", + "\n", + "#### flow of execution vs Leap of faith" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Following the flow of execution is one way to read programs, but\n", + "it can quickly become overwhelming. An\n", + "alternative is what I call the “leap of faith”. When you come to a\n", + "function call, instead of following the flow of execution, you assume that the function works correctly and returns the right\n", + "result." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In fact, you are already practicing this leap of faith when you use\n", + "built-in functions. When you call math.cos or math.exp,\n", + "you don’t examine the bodies of those functions. You just\n", + "assume that they work because the people who wrote the built-in\n", + "functions were good programmers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def circle_area(xc, yc, xp, yp): # 1, 2, 4, 6\n", + " radius = distance(xc, yc, xp, yp) # 5\n", + " result = area(radius) # 78.539\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same is true when you call one of your own functions. For\n", + "example, in Section 6.4, we wrote a function called \n", + "is_divisible that determines whether one number is divisible by\n", + "another. Once we have convinced ourselves that this function is\n", + "correct—by examining the code and testing—we can use the function\n", + "without looking at the body again.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_divisible(x, y):\n", + " if x % y == 0:\n", + " return True\n", + " else:\n", + " return False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The same is true of recursive programs. When you get to the recursive\n", + "call, instead of following the flow of execution, you should assume\n", + "that the recursive call works (returns the correct result) and then ask\n", + "yourself, “Assuming that I can find the factorial of n−1, can I\n", + "compute the factorial of n?” It is clear that you\n", + "can, by multiplying by n." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Of course, it’s a bit strange to assume that the function works\n", + "correctly when you haven’t finished writing it, but that’s why\n", + "it’s called a leap of faith!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.7 One more example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "After factorial, the most common example of a recursively\n", + "defined mathematical function is fibonacci, which has the\n", + "following definition (see\n", + "http://en.wikipedia.org/wiki/Fibonacci_number):\n", + "\n", + "\n", + "```\n", + "fibonacci(0) = 0 \n", + " \t \tfibonacci(1) = 1 \n", + " \t \tfibonacci(n) = fibonacci(n−1) + fibonacci(n−2)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Translated into Python, it looks like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "def fibonacci(n):\n", + " if n == 0:\n", + " return 0\n", + " elif n == 1:\n", + " return 1\n", + " else:\n", + " return fibonacci(n-1) + fibonacci(n-2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you try to follow the flow of execution here, even for fairly\n", + "small values of n, your head explodes. But according to the\n", + "leap of faith, if you assume that the two recursive calls\n", + "work correctly, then it is clear that you get\n", + "the right result by adding them together.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0, 1, 1, 2, 3, 5, 8, 13, 21, 34, " + ] + } + ], + "source": [ + "for i in range(0,10):\n", + " print(fibonacci(i), end=', ')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "21" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fibonacci(8.0)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "ename": "RecursionError", + "evalue": "maximum recursion depth exceeded in comparison", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mRecursionError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfibonacci\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m8.5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mfibonacci\u001b[0;34m(n)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfibonacci\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mfibonacci\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "... last 1 frames repeated, from the frame below ...\n", + "\u001b[0;32m\u001b[0m in \u001b[0;36mfibonacci\u001b[0;34m(n)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfibonacci\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mfibonacci\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mRecursionError\u001b[0m: maximum recursion depth exceeded in comparison" + ] + } + ], + "source": [ + "fibonacci(8.5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.8 Checking types" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What happens if we call factorial and give it 1.5 as an argument?\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "factorial(1.5)\n", + "RuntimeError: Maximum recursion depth exceeded" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "It looks like an infinite recursion. How can that be? The function\n", + "has a base case—when n == 0. But if n is not an integer,\n", + "we can miss the base case and recurse forever.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the first recursive call, the value of n is 0.5.\n", + "In the next, it is -0.5. From there, it gets smaller\n", + "(more negative), but it will never be 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have two choices. We can try to generalize the factorial\n", + "function to work with floating-point numbers, or we can make factorial check the type of its argument. The first option is\n", + "called the gamma function and it’s a\n", + "little beyond the scope of this book. So we’ll go for the second.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the built-in function isinstance to verify the type\n", + "of the argument. While we’re at it, we can also make sure the\n", + "argument is positive:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "def factorial(n):\n", + " if not isinstance(n, int):\n", + " print('Factorial is only defined for integers.')\n", + " return None\n", + " elif n < 0:\n", + " print('Factorial is not defined for negative integers.')\n", + " return None\n", + " elif n == 0:\n", + " return 1\n", + " else:\n", + " return n * factorial(n-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first base case handles nonintegers; the\n", + "second handles negative integers. In both cases, the program prints\n", + "an error message and returns None to indicate that something\n", + "went wrong:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Factorial is only defined for integers.\n", + "None\n" + ] + } + ], + "source": [ + "print(factorial('fred'))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Factorial is not defined for negative integers.\n", + "None\n" + ] + } + ], + "source": [ + "print(factorial(-2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If we get past both checks, we know that n is a non-negative integer, so we can prove that the recursion terminates.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This program demonstrates a pattern sometimes called a guardian.\n", + "The first two conditionals act as guardians, protecting the code that\n", + "follows from values that might cause an error. The guardians make it\n", + "possible to prove the correctness of the code." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Section 11.4 we will see a more flexible alternative to printing\n", + "an error message: raising an exception." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.9 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Breaking a large program into smaller functions creates natural\n", + "checkpoints for debugging. If a function is not\n", + "working, there are three possibilities to consider:\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To rule out the first possibility, you can add a print statement\n", + "at the beginning of the function and display the values of the\n", + "parameters (and maybe their types). Or you can write code\n", + "that checks the preconditions explicitly.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the parameters look good, add a print statement before each\n", + "return statement and display the return value. If\n", + "possible, check the result by hand. Consider calling the\n", + "function with values that make it easy to check the result\n", + "(as in Section 6.2)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the function seems to be working, look at the function call\n", + "to make sure the return value is being used correctly (or used\n", + "at all!).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding print statements at the beginning and end of a function\n", + "can help make the flow of execution more visible.\n", + "For example, here is a version of factorial with\n", + "print statements:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "def factorial(n):\n", + " space = ' ' * (4 * n)\n", + " print(space, 'factorial', n)\n", + " if n == 0:\n", + " print(space, 'returning 1')\n", + " return 1\n", + " else:\n", + " recurse = factorial(n-1)\n", + " result = n * recurse\n", + " print(space, 'returning', result)\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "space is a string of space characters that controls the\n", + "indentation of the output. Here is the result of factorial(4) :" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " factorial 4\n", + " factorial 3\n", + " factorial 2\n", + " factorial 1\n", + " factorial 0\n", + " returning 1\n", + " returning 1\n", + " returning 2\n", + " returning 6\n", + " returning 24\n" + ] + }, + { + "data": { + "text/plain": [ + "24" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "factorial(4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you are confused about the flow of execution, this kind of\n", + "output can be helpful. It takes some time to develop effective\n", + "scaffolding, but a little bit of scaffolding can save a lot of debugging." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6.10 Glossary" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_10__Lists.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_10__Lists.ipynb new file mode 100644 index 0000000..3e4ea07 --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_10__Lists.ipynb @@ -0,0 +1,1693 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 10  Lists\n", + "\n", + "http://greenteapress.com/thinkpython2/html/thinkpython2011.html\n", + "\n", + "* A list is a sequence\n", + "* Lists are mutable\n", + "* Traversing a list\n", + "* List operations\n", + "* List slices\n", + "* List methods\n", + "* Map, filter and reduce\n", + "* Deleting elements\n", + "* Lists and strings\n", + "* Objects and values\n", + "* Aliasing\n", + "* List arguments\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.1 A list is a sequence" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This chapter presents one of Python’s most useful built-in types, lists.\n", + "You will also learn more about objects and what can happen when you have\n", + "more than one name for the same object." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Like a string, a list is a sequence of values. In a string, the\n", + "values are characters; in a list, they can be any type. The values in\n", + "a list are called elements or sometimes items.\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are several ways to create a new list; the simplest is to\n", + "enclose the elements in square brackets ([ and ]):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[10, 20, 30, 40]\n", + "['crunchy frog', 'ram bladder', 'lark vomit']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first example is a list of four integers. The second is a list of\n", + "three strings. The elements of a list don’t have to be the same type.\n", + "The following list contains a string, a float, an integer, and\n", + "(lo!) another list:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "['spam', 2.0, 5, [10, 20]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A list within another list is nested.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A list that contains no elements is\n", + "called an empty list; you can create one with empty\n", + "brackets, [].\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you might expect, you can assign list values to variables:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cheeses = ['Cheddar', 'Edam', 'Gouda']\n", + "numbers = [42, 123]\n", + "empty = []\n", + "print(cheeses, numbers, empty)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.2 Lists are mutable" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The syntax for accessing the elements of a list is the same as for\n", + "accessing the characters of a string—the bracket operator. The\n", + "expression inside the brackets specifies the index. Remember that the\n", + "indices start at 0:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cheeses[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Unlike strings, lists are mutable. When the bracket operator appears\n", + "on the left side of an assignment, it identifies the element of the\n", + "list that will be assigned.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numbers = [42, 123]\n", + "numbers[1] = 5\n", + "numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numbers[4] = 5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The one-eth element of numbers, which\n", + "used to be 123, is now 5.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Figure 10.1 shows \n", + "the state diagram for cheeses, numbers and empty:\n", + "\n", + "![](http://greenteapress.com/thinkpython2/html/thinkpython2011.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lists are represented by boxes with the word “list” outside\n", + "and the elements of the list inside. cheeses refers to\n", + "a list with three elements indexed 0, 1 and 2.\n", + "numbers contains two elements; the diagram shows that the\n", + "value of the second element has been reassigned from 123 to 5.\n", + "empty refers to a list with no elements.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "List indices work the same way as string indices:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The in operator also works on lists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cheeses = ['Cheddar', 'Edam', 'Gouda']\n", + "'Edam' in cheeses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'Brie' in cheeses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.3 Traversing a list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The most common way to traverse the elements of a list is\n", + "with a for loop. The syntax is the same as for strings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for cheese in cheeses:\n", + " print(cheese)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This works well if you only need to read the elements of the\n", + "list. But if you want to write or update the elements, you\n", + "need the indices. A common way to do that is to combine\n", + "the built-in functions range and len:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(len(numbers)):\n", + " numbers[i] = numbers[i] * 2\n", + "numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i, e in enumerate(numbers):\n", + " print(i , e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This loop traverses the list and updates each element. len\n", + "returns the number of elements in the list. range returns\n", + "a list of indices from 0 to n−1, where n is the length of\n", + "the list. Each time through the loop i gets the index\n", + "of the next element. The assignment statement in the body uses\n", + "i to read the old value of the element and to assign the\n", + "new value.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A for loop over an empty list never runs the body:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for x in []:\n", + " print('This never happens.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Although a list can contain another list, the nested\n", + "list still counts as a single element. The length of this list is\n", + "four:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "['spam', 1, ['Brie', 'Roquefort', 'Pol le Veq'], [1, 2, 3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus flatten list of list and list compehension\n", + "\n", + "https://docs.python.org/3.0/tutorial/datastructures.html?highlight=list%20comprehension" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_list = [['Brie', 'Roquefort', 'Pol le Veq'], [1, 2, 3]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[element for sublist in my_list for element in sublist ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_list = ['spam', 1, ['Brie', 'Roquefort', 'Pol le Veq'], [1, 2, 3]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[element for element in my_list if not isinstance(element, list) ]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[element for sublist in my_list if isinstance(sublist, list) for element in sublist ]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.4 List operations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The + operator concatenates lists:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 2, 3]\n", + "b = [4, 5, 6]\n", + "c = a + b\n", + "c" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The * operator repeats a list a given number of times:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[0] * 4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "[1, 2, 3] * 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first example repeats [0] four times. The second example\n", + "repeats the list [1, 2, 3] three times." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.5 List slices" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The slice operator also works on lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c', 'd', 'e', 'f']\n", + "t[1:3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[:4]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[3:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you omit the first index, the slice starts at the beginning.\n", + "If you omit the second, the slice goes to the end. So if you\n", + "omit both, the slice is a copy of the whole list.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Since lists are mutable, it is often useful to make a copy\n", + "before performing operations that modify lists.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A slice operator on the left side of an assignment\n", + "can update multiple elements:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c', 'd', 'e', 'f']\n", + "t[1:3] = ['x', 'y']\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: can you reverse list with slicing?\n", + "\n", + "['f', 'e', 'd', 'y', 'x', 'a']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[::-1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.6 List methods" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python provides methods that operate on lists. For example,\n", + "append adds a new element to the end of a list:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c']\n", + "t.append('d')\n", + "t.index()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "extend takes a list as an argument and appends all of\n", + "the elements:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t1 = ['a', 'b', 'c']\n", + "t2 = ['d', 'e']\n", + "t1.extend(t2)\n", + "t1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This example leaves t2 unmodified." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "sort arranges the elements of the list from low to high:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['d', 'c', 'e', 'b', 'a']\n", + "t.sort()\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Most list methods are void; they modify the list and return None.\n", + "If you accidentally write t = t.sort(), you will be disappointed\n", + "with the result.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.7 Map, filter and reduce" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To add up all the numbers in a list, you can use a loop like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def add_all(t):\n", + " total = 0\n", + " for x in t:\n", + " total += x\n", + " return total" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "total is initialized to 0. Each time through the loop,\n", + "x gets one element from the list. The += operator\n", + "provides a short way to update a variable. This \n", + "augmented assignment statement,\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + " total += x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "is equivalent to" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + " total = total + x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As the loop runs, total accumulates the sum of the\n", + "elements; a variable used this way is sometimes called an\n", + "accumulator.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Adding up the elements of a list is such a common operation\n", + "that Python provides it as a built-in function, sum:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = [1, 2, 3]\n", + "sum(t)\n", + "6" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "**An operation like this that combines a sequence of elements into\n", + "a single value is sometimes called reduce.**\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes you want to traverse one list while building\n", + "another. For example, the following function takes a list of strings\n", + "and returns a new list that contains capitalized strings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def capitalize_all(t):\n", + " res = []\n", + " for s in t:\n", + " res.append(s.capitalize())\n", + " return res" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "res is initialized with an empty list; each time through\n", + "the loop, we append the next element. So res is another\n", + "kind of accumulator.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**An operation like capitalize_all is sometimes called a map because it “maps” a function (in this case the method capitalize) onto each of the elements in a sequence.**\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another common operation is to select some of the elements from\n", + "a list and return a sublist. For example, the following\n", + "function takes a list of strings and returns a list that contains\n", + "only the uppercase strings:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def only_upper(t):\n", + " res = []\n", + " for s in t:\n", + " if s.isupper():\n", + " res.append(s)\n", + " return res" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "isupper is a string method that returns True if\n", + "the string contains only upper case letters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**An operation like only_upper is called a filter because\n", + "it selects some of the elements and filters out the others.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Most common list operations can be expressed as a combination\n", + "of map, filter and reduce." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.8 Deleting elements" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are several ways to delete elements from a list. If you\n", + "know the index of the element you want, you can use\n", + "pop:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c']\n", + "x = t.pop(1)\n", + "t" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "pop modifies the list and returns the element that was removed.\n", + "If you don’t provide an index, it deletes and returns the\n", + "last element." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you don’t need the removed value, you can use the del\n", + "operator:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c']\n", + "del t[1]\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you know the element you want to remove (but not the index), you\n", + "can use remove:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'b', 'c']\n", + "t.remove('b')\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The return value from remove is None.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To remove more than one element, you can use del with\n", + "a slice index:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['a', 'b', 'c', 'd', 'e', 'f']\n", + "del t[1:5]\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As usual, the slice selects all the elements up to but not\n", + "including the second index." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.9 Lists and strings" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A string is a sequence of characters and a list is a sequence\n", + "of values, but a list of characters is not the same as a\n", + "string. To convert from a string to a list of characters,\n", + "you can use list:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'spam'\n", + "t = list(s)\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Because list is the name of a built-in function, you should\n", + "avoid using it as a variable name. I also avoid l because\n", + "it looks too much like 1. So that’s why I use t." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The list function breaks a string into individual letters. If\n", + "you want to break a string into words, you can use the split\n", + "method:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'pining for the fjords'\n", + "t = s.split()\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "An optional argument called a delimiter specifies which\n", + "characters to use as word boundaries.\n", + "The following example\n", + "uses a hyphen as a delimiter:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'spam-spam-spam'\n", + "delimiter = '-'\n", + "t = s.split(delimiter)\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "join is the inverse of split. It\n", + "takes a list of strings and\n", + "concatenates the elements. join is a string method,\n", + "so you have to invoke it on the delimiter and pass the\n", + "list as a parameter:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ['pining', 'for', 'the', 'fjords']\n", + "delimiter = ' '\n", + "s = delimiter.join(t)\n", + "s" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In this case the delimiter is a space character, so\n", + "join puts a space between words. To concatenate\n", + "strings without spaces, you can use the empty string,\n", + "'', as a delimiter. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.10 Objects and values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we run these assignment statements:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = 'banana'\n", + "b = 'banana'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We know that a and b both refer to a\n", + "string, but we don’t\n", + "know whether they refer to the same string.\n", + "There are two possible states, shown in Figure 10.2.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In one case, a and b refer to two different objects that\n", + "have the same value. In the second case, they refer to the same\n", + "object.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To check whether two variables refer to the same object, you can\n", + "use the is operator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = 'banana'\n", + "b = 'banana'\n", + "a is b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In this example, Python only created one string object, and both a and b refer to it. But when you create two lists, you get\n", + "two objects:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 2, 3]\n", + "b = [1, 2, 3]\n", + "a is b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "So the state diagram looks like Figure 10.3.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case we would say that the two lists are equivalent,\n", + "because they have the same elements, but not identical, because\n", + "they are not the same object. If two objects are identical, they are\n", + "also equivalent, but if they are equivalent, they are not necessarily\n", + "identical.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Until now, we have been using “object” and “value”\n", + "interchangeably, but it is more precise to say that an object has a\n", + "value. If you evaluate [1, 2, 3], you get a list\n", + "object whose value is a sequence of integers. If another\n", + "list has the same elements, we say it has the same value, but\n", + "it is not the same object.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.11 Aliasing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If a refers to an object and you assign b = a,\n", + "then both variables refer to the same object:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = [1, 2, 3]\n", + "b = a\n", + "b is a" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The state diagram looks like Figure 10.4.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The association of a variable with an object is called a reference. In this example, there are two references to the same\n", + "object.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An object with more than one reference has more\n", + "than one name, so we say that the object is aliased.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the aliased object is mutable, changes made with one alias affect\n", + "the other:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b[0] = 42\n", + "a" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Although this behavior can be useful, it is error-prone. In general,\n", + "it is safer to avoid aliasing when you are working with mutable\n", + "objects.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For immutable objects like strings, aliasing is not as much of a\n", + "problem. In this example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = 'banana'\n", + "b = 'banana'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "It almost never makes a difference whether a and b refer\n", + "to the same string or not." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.12 List arguments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you pass a list to a function, the function gets a reference to\n", + "the list. If the function modifies the list, the caller sees\n", + "the change. For example, delete_head removes the first element\n", + "from a list:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "def delete_head(t):\n", + " del t[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Here’s how it is used:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['b', 'c']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "letters = ['a', 'b', 'c']\n", + "delete_head(letters)\n", + "letters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The parameter t and the variable letters are\n", + "aliases for the same object. The stack diagram looks like\n", + "Figure 10.5.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the list is shared by two frames, I drew\n", + "it between them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is important to distinguish between operations that\n", + "modify lists and operations that create new lists. For\n", + "example, the append method modifies a list, but the\n", + "+ operator creates a new list.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s an example using append:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t1 = [1, 2]\n", + "t2 = t1.append(3)\n", + "t1" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "t2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The return value from append is None." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s an example using the + operator:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t3 = t1 + [4]\n", + "t1" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3, 4]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result of the operator is a new list, and the original list is\n", + "unchanged." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This difference is important when you write functions that\n", + "are supposed to modify lists. For example, this function\n", + "does not delete the head of a list:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "def bad_delete_head(t):\n", + " t = t[1:] # WRONG!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The slice operator creates a new list and the assignment\n", + "makes t refer to it, but that doesn’t affect the caller.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t4 = [1, 2, 3]\n", + "bad_delete_head(t4)\n", + "t4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "At the beginning of bad_delete_head, t and t4\n", + "refer to the same list. At the end, t refers to a new list,\n", + "but t4 still refers to the original, unmodified list." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An alternative is to write a function that creates and\n", + "returns a new list. For\n", + "example, tail returns all but the first\n", + "element of a list:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "def tail(t):\n", + " return t[1:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This function leaves the original list unmodified.\n", + "Here’s how it is used:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['b', 'c']" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "letters = ['a', 'b', 'c']\n", + "rest = tail(letters)\n", + "rest\n", + "['b', 'c']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_11__Dictionaries.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_11__Dictionaries.ipynb new file mode 100644 index 0000000..da79412 --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_11__Dictionaries.ipynb @@ -0,0 +1,2168 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 11  Dictionaries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "http://greenteapress.com/thinkpython2/html/thinkpython2012.html\n", + "\n", + "* 11.1  A dictionary is a mapping\n", + "* 11.2  Dictionary as a collection of counters\n", + "* 11.3  Looping and dictionaries\n", + "* 11.4  Reverse lookup\n", + "* 11.5  Dictionaries and lists\n", + "* 11.6  Memos\n", + "* 11.7  Global variables\n", + "* 11.8  Debugging\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Python: List vs Tuple vs Dictionary vs Set](https://blog.softhints.com/python-list-vs-tuple-vs-dictionary-vs-set/)\n", + "\n", + "![](https://blog.softhints.com/content/images/size/w2000/2020/04/python_dict_vs_list_vs_tuple_vs_set.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.1 A dictionary is a mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This chapter presents another built-in type called a dictionary.\n", + "Dictionaries are one of Python’s best features; they are the\n", + "building blocks of many efficient and elegant algorithms." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "A dictionary is like a list, but more general. In a list,\n", + "the indices have to be integers; in a dictionary they can\n", + "be (almost) any type." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A dictionary contains a collection of indices, which are called keys, and a collection of values. Each key is associated with a\n", + "single value. **The association of a key and a value is called a key-value pair** or sometimes an item. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In mathematical language, a dictionary represents a mapping\n", + "from keys to values, so you can also say that each key\n", + "“maps to” a value.\n", + "As an example, we’ll build a dictionary that maps from English\n", + "to Spanish words, so the keys and the values are all strings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The function dict creates a new dictionary with no items.\n", + "Because dict is the name of a built-in function, you\n", + "should avoid using it as a variable name.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{}" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp = dict()\n", + "eng2sp" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{}" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp = {}\n", + "eng2sp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The squiggly-brackets, {}, represent an empty dictionary.\n", + "To add items to the dictionary, you can use square brackets:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "eng2sp['one'] = 'uno'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This line creates an item that maps from the key\n", + "'one' to the value 'uno'. If we print the\n", + "dictionary again, we see a key-value pair with a colon\n", + "between the key and value:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'one': 'uno'}" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "eng2sp['one'] = '1'" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'one': '1'}" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This output format is also an input format. For example,\n", + "you can create a new dictionary with three items:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "eng2sp = {'one': 'uno',\n", + " 'two': 'dos', \n", + " 'three': 'tres'\n", + " }" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But if you print eng2sp, you might be surprised:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'one': 'two', 'two': 'dos', 'three': 'tres'}" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The order of the key-value pairs might not be the same. If\n", + "you type the same example on your computer, you might get a\n", + "different result. In general, the order of items in\n", + "a dictionary is unpredictable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But that’s not a problem because\n", + "the elements of a dictionary are never indexed with integer indices.\n", + "Instead, you use the keys to look up the corresponding values:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'dos'" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eng2sp['two']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The key 'two' always maps to the value 'dos' so the order\n", + "of the items doesn’t matter." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the key isn’t in the dictionary, you get an exception:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "ename": "KeyError", + "evalue": "'four'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0meng2sp\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'four'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m: 'four'" + ] + } + ], + "source": [ + "eng2sp['four']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The len function works on dictionaries; it returns the\n", + "number of key-value pairs:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(eng2sp)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The in operator works on dictionaries, too; it tells you whether\n", + "something appears as a key in the dictionary (appearing\n", + "as a value is not good enough).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'one' in eng2sp" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'uno' in eng2sp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "To see whether something appears as a value in a dictionary, you\n", + "can use the method values, which returns a collection of\n", + "values, and then use the in operator:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vals = eng2sp.values()\n", + "'uno' in vals" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "dict_values(['uno', 'dos', 'tres'])\n", + "dict_keys(['one', 'two', 'three'])\n", + "dict_items([('one', 'uno'), ('two', 'dos'), ('three', 'tres')])\n" + ] + } + ], + "source": [ + "print(eng2sp.values())\n", + "print(eng2sp.keys())\n", + "print(eng2sp.items())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The in operator uses different algorithms for lists and\n", + "dictionaries. For lists, it searches the elements of the list in\n", + "order, as in Section 8.6. As the list gets longer, the search\n", + "time gets longer in direct proportion." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python dictionaries use a data structure\n", + "called a hashtable that has a remarkable property: the\n", + "in operator takes about the same amount of time no matter how\n", + "many items are in the dictionary. I explain how that’s possible\n", + "in Section B.4, but the explanation might not make\n", + "sense until you’ve read a few more chapters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Bonus**:\n", + "\n", + "* [Hash function](https://en.wikipedia.org/wiki/Hash_function)\n", + "* [Hash table](https://en.wikipedia.org/wiki/Hash_table)\n", + "* [Collision (computer science)](https://en.wikipedia.org/wiki/Collision_(computer_science))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.2 Dictionary as a collection of counters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Suppose you are given a string and you want to count how many\n", + "times each letter appears. There are several ways you could do it:\n", + "\n", + "1. You could create 26 variables, one for each letter of the alphabet. Then you could traverse the string and, for each character, increment the corresponding counter, probably using a chained conditional.\n", + "\n", + "2. You could create a list with 26 elements. Then you could convert each character to a number (using the built-in function ord), use the number as an index into the list, and increment the appropriate counter.\n", + "\n", + "3. You could create a dictionary with characters as keys and counters as the corresponding values. The first time you see a character, you would add an item to the dictionary. After that you would increment the value of an existing item." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each of these options performs the same computation, but each\n", + "of them implements that computation in a different way.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An implementation is a way of performing a computation;\n", + "some implementations are better than others. For example,\n", + "an advantage of the dictionary implementation is that we don’t\n", + "have to know ahead of time which letters appear in the string\n", + "and we only have to make room for the letters that do appear." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is what the code might look like:" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "def histogram(s):\n", + " d = dict()\n", + " for c in s:\n", + " if c not in d:\n", + " d[c] = 1\n", + " else:\n", + " d[c] += 1\n", + " return d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The name of the function is histogram, which is a statistical\n", + "term for a collection of counters (or frequencies).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The first line of the\n", + "function creates an empty dictionary. The for loop traverses\n", + "the string. Each time through the loop, if the character c is\n", + "not in the dictionary, we create a new item with key c and the\n", + "initial value 1 (since we have seen this letter once). If c is\n", + "already in the dictionary we increment d[c].\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s how it works:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h = histogram('brontosaurus')\n", + "h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The histogram indicates that the letters 'a' and 'b'\n", + "appear once; 'o' appears twice, and so on." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "Dictionaries have a method called get that takes a key\n", + "and a default value. If the key appears in the dictionary,\n", + "get returns the corresponding value; otherwise it returns\n", + "the default value. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'a': 1}" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h = histogram('a')\n", + "h" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h.get('a', 0)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h.get('c', 0)" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "ename": "KeyError", + "evalue": "'c'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mh\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'c'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mKeyError\u001b[0m: 'c'" + ] + } + ], + "source": [ + "h['c']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As an exercise, use get to write histogram more concisely. You\n", + "should be able to eliminate the if statement." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Excercise " + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [], + "source": [ + "def histogram(s):\n", + " d = dict()\n", + " for c in s:\n", + " d[c] = d.get(c, 0) + 1\n", + " return d" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'b': 1, 'r': 2, 'o': 2, 'n': 1, 't': 1, 's': 2, 'a': 1, 'u': 2}" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h = histogram('brontosaurus')\n", + "h" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.3 Looping and dictionaries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you use a dictionary in a for statement, it traverses\n", + "the keys of the dictionary. For example, print_hist\n", + "prints each key and the corresponding value:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [], + "source": [ + "def print_hist(h):\n", + " for c in h:\n", + " print(c, h[c])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Here’s what the output looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "p 1\n", + "a 1\n", + "r 2\n", + "o 1\n", + "t 1\n" + ] + } + ], + "source": [ + "h = histogram('parrot')\n", + "print_hist(h)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Again, the keys are in no particular order. To traverse the keys\n", + "in sorted order, you can use the built-in function sorted:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "a 1\n", + "o 1\n", + "p 1\n", + "r 2\n", + "t 1\n" + ] + } + ], + "source": [ + "for key in sorted(h):\n", + " print(key, h[key])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Bonus Getting all keys and values" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "p 1\n", + "a 1\n", + "r 2\n", + "o 1\n", + "t 1\n" + ] + } + ], + "source": [ + "for k, v in h.items():\n", + " print(k, v)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.4 Reverse lookup" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Given a dictionary d and a key k, it is easy to\n", + "find the corresponding value v = d[k]. This operation\n", + "is called a lookup." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But what if you have v and you want to find k?\n", + "You have two problems: first, there might be more than one\n", + "key that maps to the value v. Depending on the application,\n", + "you might be able to pick one, or you might have to make\n", + "a list that contains all of them. Second, there is no\n", + "simple syntax to do a reverse lookup; you have to search." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is a function that takes a value and returns the first\n", + "key that maps to that value:" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [], + "source": [ + "def reverse_lookup(d, v):\n", + " for k in d:\n", + " if d[k] == v:\n", + " return k\n", + " raise LookupError()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This function is yet another example of the search pattern, but it\n", + "uses a feature we haven’t seen before, raise. The \n", + "raise statement causes an exception; in this case it causes a\n", + "LookupError, which is a built-in exception used to indicate\n", + "that a lookup operation failed.\n", + "\n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we get to the end of the loop, that means v\n", + "doesn’t appear in the dictionary as a value, so we raise an\n", + "exception." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a successful reverse lookup:" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'r'" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "h = histogram('parrot')\n", + "key = reverse_lookup(h, 2)\n", + "key" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'p': 1, 'a': 1, 'r': 2, 'o': 1, 't': 1}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "histogram('parrot')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "And an unsuccessful one:" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "ename": "LookupError", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mLookupError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mreverse_lookup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mreverse_lookup\u001b[0;34m(d, v)\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0md\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mLookupError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mLookupError\u001b[0m: " + ] + } + ], + "source": [ + "key = reverse_lookup(h, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The effect when you raise an exception is the same as when\n", + "Python raises one: it prints a traceback and an error message.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you raise an exception, you can provide a detailed error message as an optional argument. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "ename": "LookupError", + "evalue": "value does not appear in the dictionary", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mLookupError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mLookupError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'value does not appear in the dictionary'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mreverse_lookup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mreverse_lookup\u001b[0;34m(d, v)\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0md\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mv\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mk\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mLookupError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'value does not appear in the dictionary'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mreverse_lookup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mLookupError\u001b[0m: value does not appear in the dictionary" + ] + } + ], + "source": [ + "def reverse_lookup(d, v):\n", + " for k in d:\n", + " if d[k] == v:\n", + " return k\n", + " raise LookupError('value does not appear in the dictionary')\n", + "key = reverse_lookup(h, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raise LookupError('value does not appear in the dictionary')\n", + "Traceback (most recent call last):\n", + " File \"\", line 1, in ?\n", + "LookupError: value does not appear in the dictionary" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A reverse lookup is much slower than a forward lookup; if you\n", + "have to do it often, or if the dictionary gets big, the performance\n", + "of your program will suffer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.5 Dictionaries and lists" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lists can appear as values in a dictionary. For example, if you\n", + "are given a dictionary that maps from letters to frequencies, you\n", + "might want to invert it; that is, create a dictionary that maps\n", + "from frequencies to letters. Since there might be several letters\n", + "with the same frequency, each value in the inverted dictionary\n", + "should be a list of letters.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is a function that inverts a dictionary:" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "def invert_dict(d):\n", + " inverse = dict()\n", + " for key in d:\n", + " val = d[key]\n", + " if val not in inverse:\n", + " inverse[val] = [key]\n", + " else:\n", + " inverse[val].append(key)\n", + " return inverse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Each time through the loop, key gets a key from d and \n", + "val gets the corresponding value. If val is not in inverse, that means we haven’t seen it before, so we create a new\n", + "item and initialize it with a singleton (a list that contains a\n", + "single element). Otherwise we have seen this value before, so we\n", + "append the corresponding key to the list. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'p': 1, 'a': 1, 'r': 2, 'o': 1, 't': 1}" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hist = histogram('parrot')\n", + "hist" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{1: ['p', 'a', 'o', 't'], 2: ['r']}" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "inverse = invert_dict(hist)\n", + "inverse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Figure 11.1 is a state diagram showing hist and inverse.\n", + "A dictionary is represented as a box with the type dict above it\n", + "and the key-value pairs inside. If the values are integers, floats or\n", + "strings, I draw them inside the box, but I usually draw lists\n", + "outside the box, just to keep the diagram simple.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lists can be values in a dictionary, as this example shows, but they\n", + "cannot be keys. Here’s what happens if you try:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "ename": "TypeError", + "evalue": "unhashable type: 'list'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mt\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0md\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0md\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mt\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'oops'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'" + ] + } + ], + "source": [ + "t = [1, 2, 3]\n", + "d = dict()\n", + "d[t] = 'oops'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "I mentioned earlier that a dictionary is implemented using\n", + "a hashtable and that means that the keys have to be hashable.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A hash is a function that takes a value (of any kind)\n", + "and returns an integer. Dictionaries use these integers,\n", + "called hash values, to store and look up key-value pairs.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This system works fine if the keys are immutable. But if the\n", + "keys are mutable, like lists, bad things happen. For example,\n", + "when you create a key-value pair, Python hashes the key and \n", + "stores it in the corresponding location. If you modify the\n", + "key and then hash it again, it would go to a different location.\n", + "In that case you might have two entries for the same key,\n", + "or you might not be able to find a key. Either way, the\n", + "dictionary wouldn’t work correctly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That’s why keys have to be hashable, and why mutable types like\n", + "lists aren’t. The simplest way to get around this limitation is to\n", + "use tuples, which we will see in the next chapter." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since dictionaries are mutable, they can’t be used as keys,\n", + "but they can be used as values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.6 Memos" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you played with the fibonacci function from\n", + "Section 6.7, you might have noticed that the bigger\n", + "the argument you provide, the longer the function takes to run.\n", + "Furthermore, the run time increases quickly.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To understand why, consider Figure 11.2, which shows\n", + "the call graph for fibonacci with n=4:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A call graph shows a set of function frames, with lines connecting each\n", + "frame to the frames of the functions it calls. At the top of the\n", + "graph, fibonacci with n=4 calls fibonacci with n=3 and n=2. In turn, fibonacci with n=3 calls\n", + "fibonacci with n=2 and n=1. And so on.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Count how many times fibonacci(0) and fibonacci(1) are\n", + "called. This is an inefficient solution to the problem, and it gets\n", + "worse as the argument gets bigger.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One solution is to keep track of values that have already been\n", + "computed by storing them in a dictionary. A previously computed value\n", + "that is stored for later use is called a memo. Here is a\n", + "“memoized” version of fibonacci:" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [], + "source": [ + "known = {0:0, 1:1}\n", + "\n", + "def fibonacci(n):\n", + " if n in known:\n", + " return known[n]\n", + "\n", + " res = fibonacci(n-1) + fibonacci(n-2)\n", + " known[n] = res\n", + " return res" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "known is a dictionary that keeps track of the Fibonacci\n", + "numbers we already know. It starts with\n", + "two items: 0 maps to 0 and 1 maps to 1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Whenever fibonacci is called, it checks known.\n", + "If the result is already there, it can return\n", + "immediately. Otherwise it has to \n", + "compute the new value, add it to the dictionary, and return it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you run this version of fibonacci and compare it with\n", + "the original, you will find that it is much faster." + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The slowest run took 214.19 times longer than the fastest. This could mean that an intermediate result is being cached.\n", + "10000000 loops, best of 3: 109 ns per loop\n" + ] + } + ], + "source": [ + "% timeit fibonacci(20)" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [], + "source": [ + "def fibonacci(n):\n", + " if n < 1:\n", + " return 1\n", + " res = fibonacci(n-1) + fibonacci(n-2)\n", + " return res" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100 loops, best of 3: 3.59 ms per loop\n" + ] + } + ], + "source": [ + "% timeit fibonacci(20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Bonus \n", + "\n", + "* [Dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming)\n", + "* [19. Dynamic Programming I: Fibonacci, Shortest Paths](https://www.youtube.com/watch?v=OQ5jsbhAv_M)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.7 Global variables" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the previous example, known is created outside the function,\n", + "so it belongs to the special frame called __main__.\n", + "Variables in __main__ are sometimes called global\n", + "because they can be accessed from any function. Unlike local\n", + "variables, which disappear when their function ends, global variables\n", + "persist from one function call to the next.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is common to use global variables for flags; that is, \n", + "boolean variables that indicate (“flag”) whether a condition\n", + "is true. For example, some programs use\n", + "a flag named verbose to control the level of detail in the\n", + "output:" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running example1\n" + ] + } + ], + "source": [ + "verbose = True\n", + "\n", + "def example1():\n", + " if verbose:\n", + " print('Running example1')\n", + "example1()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you try to reassign a global variable, you might be surprised.\n", + "The following example is supposed to keep track of whether the\n", + "function has been called:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "been_called = False\n", + "\n", + "def example2():\n", + " been_called = True # WRONG\n", + "\n", + "been_called" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But if you run it you will see that the value of been_called\n", + "doesn’t change. The problem is that example2 creates a new local\n", + "variable named been_called. The local variable goes away when\n", + "the function ends, and has no effect on the global variable.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To reassign a global variable inside a function you have to\n", + "declare the global variable before you use it:" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "been_called = False\n", + "\n", + "def example2():\n", + " global been_called \n", + " been_called = True\n", + " \n", + "example2()\n", + " \n", + "been_called" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Pause the video and find why the `been_called` is False?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The global statement tells the interpreter\n", + "something like, “In this function, when I say been_called, I\n", + "mean the global variable; don’t create a local one.”\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s an example that tries to update a global variable:" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [ + { + "ename": "UnboundLocalError", + "evalue": "local variable 'count' referenced before assignment", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mUnboundLocalError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mcount\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcount\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m \u001b[0;31m# WRONG\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mexample3\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36mexample3\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mexample3\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mcount\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcount\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m \u001b[0;31m# WRONG\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mexample3\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mUnboundLocalError\u001b[0m: local variable 'count' referenced before assignment" + ] + } + ], + "source": [ + "count = 0\n", + "\n", + "def example3():\n", + " count = count + 1 # WRONG\n", + " \n", + "example3()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Python assumes that count is local, and under that assumption\n", + "you are reading it before writing it. The solution, again,\n", + "is to declare count global.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [], + "source": [ + "def example3():\n", + " global count\n", + " count += 1\n", + "example3()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If a global variable refers to a mutable value, you can modify\n", + "the value without declaring the variable:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 0, 1: 1, 2: 1}" + ] + }, + "execution_count": 79, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "known = {0:0, 1:1}\n", + "\n", + "def example4():\n", + " known[2] = 1\n", + "example4()\n", + "known" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "So you can add, remove and replace elements of a global list or\n", + "dictionary, but if you want to reassign the variable, you\n", + "have to declare it:" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [], + "source": [ + "def example5():\n", + " global known\n", + " known = dict()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Global variables can be useful, but if you have a lot of them,\n", + "and you modify them frequently, they can make programs\n", + "hard to debug." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.8 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you work with bigger datasets it can become unwieldy to\n", + "debug by printing and checking the output by hand. Here are some\n", + "suggestions for debugging large datasets:\n", + "\n", + "**1. Scale down the input:**\n", + "If possible, reduce the size of the dataset. For example if the program reads a text file, start with just the first 10 lines, or with the smallest example you can find. You can either edit the files themselves, or (better) modify the program so it reads only the first n lines.\n", + "If there is an error, you can reduce n to the smallest value that manifests the error, and then increase it gradually as you find and correct errors.\n", + "\n", + "**2. Check summaries and types:**\n", + "Instead of printing and checking the entire dataset, consider printing summaries of the data: for example, the number of items in a dictionary or the total of a list of numbers.\n", + "A common cause of runtime errors is a value that is not the right type. For debugging this kind of error, it is often enough to print the type of a value.\n", + "\n", + "**3. Write self-checks:**\n", + "Sometimes you can write code to check for errors automatically. For example, if you are computing the average of a list of numbers, you could check that the result is not greater than the largest element in the list or less than the smallest. This is called a “sanity check” because it detects results that are “insane”.\n", + "Another kind of check compares the results of two different computations to see if they are consistent. This is called a “consistency check”.\n", + "\n", + "**4. Format the output:**\n", + "Formatting debugging output can make it easier to spot an error. We saw an example in Section 6.9. Another tool you might find useful is the pprint module, which provides a pprint function that displays built-in types in a more human-readable format (pprint stands for “pretty print”)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, time you spend building scaffolding can reduce\n", + "the time you spend debugging.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(5043, 28)" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
2ColorSam Mendes602.0148.00.0161.0Rory Kinnear11000.0200074175.0Action|Adventure|Thriller...994.0EnglishUKPG-13245000000.02015.0393.06.82.3585000
3ColorChristopher Nolan813.0164.022000.023000.0Christian Bale27000.0448130642.0Action|Thriller...2701.0EnglishUSAPG-13250000000.02012.023000.08.52.35164000
4NaNDoug WalkerNaNNaN131.0NaNRob Walker131.0NaNDocumentary...NaNNaNNaNNaNNaNNaN12.07.1NaN0
\n", + "

5 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "2 Color Sam Mendes 602.0 148.0 \n", + "3 Color Christopher Nolan 813.0 164.0 \n", + "4 NaN Doug Walker NaN NaN \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "2 0.0 161.0 Rory Kinnear \n", + "3 22000.0 23000.0 Christian Bale \n", + "4 131.0 NaN Rob Walker \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n", + "2 11000.0 200074175.0 Action|Adventure|Thriller ... \n", + "3 27000.0 448130642.0 Action|Thriller ... \n", + "4 131.0 NaN Documentary ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "0 3054.0 English USA PG-13 237000000.0 \n", + "1 1238.0 English USA PG-13 300000000.0 \n", + "2 994.0 English UK PG-13 245000000.0 \n", + "3 2701.0 English USA PG-13 250000000.0 \n", + "4 NaN NaN NaN NaN NaN \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "0 2009.0 936.0 7.9 1.78 \n", + "1 2007.0 5000.0 7.1 2.35 \n", + "2 2015.0 393.0 6.8 2.35 \n", + "3 2012.0 23000.0 8.5 2.35 \n", + "4 NaN 12.0 7.1 NaN \n", + "\n", + " movie_facebook_likes \n", + "0 33000 \n", + "1 0 \n", + "2 85000 \n", + "3 164000 \n", + "4 0 \n", + "\n", + "[5 rows x 28 columns]" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3390669.0" + ] + }, + "execution_count": 85, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sum(df['director_facebook_likes'].fillna(0))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 inf\n", + "1 0.300178\n", + "2 inf\n", + "3 0.007455\n", + "4 NaN\n", + " ... \n", + "5038 43.500000\n", + "5039 NaN\n", + "5040 inf\n", + "5041 inf\n", + "5042 5.625000\n", + "Length: 5043, dtype: float64" + ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['duration'] / df['director_facebook_likes']" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.0 907\n", + "NaN 104\n", + "3.0 70\n", + "6.0 66\n", + "7.0 64\n", + " ... \n", + "104.0 1\n", + "224.0 1\n", + "220.0 1\n", + "522.0 1\n", + "764.0 1\n", + "Name: director_facebook_likes, Length: 436, dtype: int64" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['director_facebook_likes'].value_counts(dropna=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11.9 Glossary" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_12__Tuples.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_12__Tuples.ipynb new file mode 100644 index 0000000..fc2e3da --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_12__Tuples.ipynb @@ -0,0 +1,1497 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 12  Tuples\n", + "\n", + "\n", + "* 12.1  Tuples are immutable\n", + "* 12.2  Tuple assignment\n", + "* 12.3  Tuples as return values\n", + "* 12.4  Variable-length argument tuples\n", + "* 12.5  Lists and tuples\n", + "* 12.6  Dictionaries and tuples\n", + "* 12.7  Sequences of sequences" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.1 Tuples are immutable" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This chapter presents one more built-in type, the tuple, and then\n", + "shows how lists, dictionaries, and tuples work together.\n", + "I also present a useful feature for variable-length argument lists,\n", + "the gather and scatter operators." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One note: there is no consensus on how to pronounce “tuple”. Some people say **“tuh-ple”**, which rhymes with “supple”. But in the context of programming, most people say **“too-ple”**, which rhymes with “quadruple”." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " A tuple is a sequence of values. The values can be any type, and\n", + "they are indexed by integers, so in that respect tuples are a lot\n", + "like lists. The important difference is that tuples are immutable.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Syntactically, a tuple is a comma-separated list of values:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = 'a', 'b', 'c', 'd', 'e'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Although it is not necessary, it is common to enclose tuples in\n", + "parentheses:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ('a', 'b', 'c', 'd', 'e')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "To create a tuple with a single element, you have to include a final\n", + "comma:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t1 = 'a',\n", + "type(t1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A value in parentheses is not a tuple:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t2 = ('a')\n", + "type(t2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Another way to create a tuple is the built-in function tuple.\n", + "With no argument, it creates an empty tuple:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = tuple()\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the argument is a sequence (string, list or tuple), the result\n", + "is a tuple with the elements of the sequence:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = tuple('lupins')\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Because tuple is the name of a built-in function, you should\n", + "avoid using it as a variable name." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Most list operators also work on tuples. The bracket operator\n", + "indexes an element:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ('a', 'b', 'c', 'd', 'e')\n", + "t[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "And the slice operator selects a range of elements.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[1:3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " But if you try to modify one of the elements of the tuple, you get\n", + "an error:\n", + "
\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t[0] = 'A'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Because tuples are immutable, you can’t modify the elements. But you\n", + "can replace one tuple with another:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = ('A',) + t[1:]\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This statement makes a new tuple and then makes t refer to it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The relational operators work with tuples and other sequences;\n", + "Python starts by comparing the first element from each\n", + "sequence. If they are equal, it goes on to the next elements,\n", + "and so on, until it finds elements that differ. Subsequent\n", + "elements are not considered (even if they are really big).\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(0, 1, 2) < (0, 3, 4)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "(0, 1, 2000000) < (0, 3, 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.2 Tuple assignment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often useful to swap the values of two variables.\n", + "With conventional assignments, you have to use a temporary\n", + "variable. For example, to swap a and b:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a = 4\n", + "b = 3\n", + "print(f'a: {a}, b: {b}')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp = a\n", + "a = b\n", + "b = temp" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f'a: {a}, b: {b}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

Bonus: Tower of Hanoi

\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This solution is cumbersome; tuple assignment is more elegant:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a, b = b, a" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The left side is a tuple of variables; the right side is a tuple of\n", + "expressions. Each value is assigned to its respective variable. \n", + "All the expressions on the right side are evaluated before any\n", + "of the assignments." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
\n", + " The number of variables on the left and the number of\n", + "values on the right have to be the same:\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "a, b = 1, 2, 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "More generally, the right side can be any kind of sequence\n", + "(string, list or tuple). For example, to split an email address\n", + "into a user name and a domain, you could write:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "addr = 'monty@python.org'\n", + "uname, domain = addr.split('@')\n", + "uname, domain" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = ['Everest', 8849, 27.9881, 86.9250]\n", + "name, height, latitude, longitude = data\n", + "\n", + "print(name, height, latitude, longitude)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The return value from split is a list with two elements;\n", + "the first element is assigned to uname, the second to\n", + "domain." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "uname #'monty'\n", + "domain #'python.org'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.3 Tuples as return values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Strictly speaking, a function can only return one value, but\n", + "if the value is a tuple, the effect is the same as returning\n", + "multiple values. For example, if you want to divide two integers\n", + "and compute the quotient and remainder, it is inefficient to\n", + "compute x//y and then x%y. It is better to compute\n", + "them both at the same time.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The built-in function divmod takes two arguments and\n", + "returns a tuple of two values, the quotient and remainder.\n", + "You can store the result as a tuple:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = divmod(7, 3)\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Or use tuple assignment to store the elements separately:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "quot, rem = divmod(7, 3)\n", + "quot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rem" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Here is an example of a function that returns a tuple:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def min_max(t):\n", + " return min(t), max(t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "max and min are built-in functions that find\n", + "the largest and smallest elements of a sequence. min_max\n", + "computes both and returns a tuple of two values.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.4 Variable-length argument tuples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Functions can take a variable number of arguments. A parameter\n", + "name that begins with * gathers arguments into\n", + "a tuple. For example, printall\n", + "takes any number of arguments and prints them:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def printall(*args):\n", + " print(args)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The gather parameter can have any name you like, but args is\n", + "conventional. Here’s how the function works:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "printall(1, 2.0, '3','x')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "
\n", + " The complement of gather is scatter. If you have a\n", + "sequence of values and you want to pass it to a function\n", + "as multiple arguments, you can use the * operator.\n", + "For example, divmod takes exactly two arguments; it\n", + "doesn’t work with a tuple:\n", + "
\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = (7, 3)\n", + "divmod(t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " But if you scatter the tuple, it works:\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "divmod(*t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Many of the built-in functions use\n", + "variable-length argument tuples. For example, max\n", + "and min can take any number of arguments:\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "max(1, 2, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "But sum does not.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sum(1, 2, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As an exercise, write a function called sum_all that takes any number\n", + "of arguments and returns their sum." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.5 Lists and tuples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "zip is a built-in function that takes two or more sequences and\n", + "interleaves them. The name of the function refers to\n", + "a zipper, which interleaves two rows of teeth." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This example zips a string and a list:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'abc'\n", + "t = [0, 1, 2]\n", + "zip(s, t)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result is a zip object that knows how to iterate through\n", + "the pairs. The most common use of zip is in a for loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for pair in zip(s, t):\n", + " print(pair)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A zip object is a kind of iterator, which is any object\n", + "that iterates through a sequence. Iterators are similar to lists in some\n", + "ways, but unlike lists, you can’t use an index to select an element from\n", + "an iterator.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to use list operators and methods, you can\n", + "use a zip object to make a list:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "zip(s, t)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(zip(s, t))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result is a list of tuples; in this example, each tuple contains\n", + "a character from the string and the corresponding element from\n", + "the list.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the sequences are not the same length, the result has the\n", + "length of the shorter one." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "list(zip('Anne', 'Elk'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "You can use tuple assignment in a for loop to traverse a list of\n", + "tuples:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = [('a', 0), ('b', 1), ('c', 2)]\n", + "for letter, number in t:\n", + " print(number, letter)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Each time through the loop, Python selects the next tuple in\n", + "the list and assigns the elements to letter and \n", + "number. The output of this loop is:\n", + "
\n", + "\n", + "0 a\n", + "\n", + "1 b\n", + "\n", + "2 c" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you combine zip, for and tuple assignment, you get a\n", + "useful idiom for traversing two (or more) sequences at the same\n", + "time. For example, has_match takes two sequences, t1 and\n", + "t2, and returns True if there is an index i\n", + "such that t1[i] == t2[i]:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def has_match(t1, t2):\n", + " for x, y in zip(t1, t2):\n", + " if x == y:\n", + " return True\n", + " return False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you need to traverse the elements of a sequence and their\n", + "indices, you can use the built-in function enumerate:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for index, element in enumerate('abc'):\n", + " print(index, element)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result from enumerate is an enumerate object, which\n", + "iterates a sequence of pairs; each pair contains an index (starting\n", + "from 0) and an element from the given sequence.\n", + "In this example, the output is" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "0 a\n", + "1 b\n", + "2 c" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Again.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.6 Dictionaries and tuples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "Dictionaries have a method called items that returns a sequence of\n", + "tuples, where each tuple is a key-value pair.\n", + "
\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "d = {'a':0, 'b':1, 'c':2}\n", + "t = d.items()\n", + "t" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result is a dict_items object, which is an iterator that\n", + "iterates the key-value pairs. You can use it in a for loop\n", + "like this:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for key, value in d.items():\n", + " print(key, value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As you should expect from a dictionary, the items are in no\n", + "particular order." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "
\n", + "Going in the other direction, you can use a list of tuples to\n", + "initialize a new dictionary: \n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t = [('a', 0), ('c', 2), ('b', 1)]\n", + "d = dict(t)\n", + "d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Combining dict with zip yields a concise way\n", + "to create a dictionary:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "d = dict(zip('abc', range(3)))\n", + "d" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The dictionary method update also takes a list of tuples\n", + "and adds them, as key-value pairs, to an existing dictionary.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is common to use tuples as keys in dictionaries (primarily because\n", + "you can’t use lists). For example, a telephone directory might map\n", + "from last-name, first-name pairs to telephone numbers. Assuming\n", + "that we have defined last, first and number, we\n", + "could write:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "directory[last, first] = number" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The expression in brackets is a tuple. We could use tuple\n", + "assignment to traverse this dictionary.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for last, first in directory:\n", + " print(first, last, directory[last,first])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This loop traverses the keys in directory, which are tuples. It\n", + "assigns the elements of each tuple to last and first, then\n", + "prints the name and corresponding telephone number." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are two ways to represent tuples in a state diagram. The more\n", + "detailed version shows the indices and elements just as they appear in\n", + "a list. For example, the tuple ('Cleese', 'John') would appear\n", + "as in Figure 12.1.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But in a larger diagram you might want to leave out the\n", + "details. For example, a diagram of the telephone directory might\n", + "appear as in Figure 12.2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here the tuples are shown using Python syntax as a graphical\n", + "shorthand. The telephone number in the diagram is the complaints line\n", + "for the BBC, so please don’t call it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.7 Sequences of sequences" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I have focused on lists of tuples, but almost all of the examples in\n", + "this chapter also work with lists of lists, tuples of tuples, and\n", + "tuples of lists. To avoid enumerating the possible combinations, it\n", + "is sometimes easier to talk about sequences of sequences." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In many contexts, the different kinds of sequences (strings, lists and\n", + "tuples) can be used interchangeably. So how should you choose one\n", + "over the others?\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "To start with the obvious, strings are more limited than other\n", + "sequences because the elements have to be characters. They are\n", + "also immutable. If you need the ability to change the characters\n", + "in a string (as opposed to creating a new string), you might\n", + "want to use a list of characters instead.\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
\n", + "Lists are more common than tuples, mostly because they are mutable.\n", + "But there are a few cases where you might prefer tuples:\n", + "\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "Because tuples are immutable, they don’t provide methods like sort and reverse, which modify existing lists. But Python\n", + "provides the built-in function sorted, which takes any sequence\n", + "and returns a new list with the same elements in sorted order, and\n", + "reversed, which takes a sequence and returns an iterator that\n", + "traverses the list in reverse order.\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.8 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lists, dictionaries and tuples are examples of data\n", + "structures; in this chapter we are starting to see compound data\n", + "structures, like lists of tuples, or dictionaries that contain tuples\n", + "as keys and lists as values. Compound data structures are useful, but\n", + "they are prone to what I call shape errors; that is, errors\n", + "caused when a data structure has the wrong type, size, or structure.\n", + "For example, if you are expecting a list with one integer and I\n", + "give you a plain old integer (not in a list), it won’t work.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To help debug these kinds of errors, I have written a module\n", + "called structshape that provides a function, also called\n", + "structshape, that takes any kind of data structure as\n", + "an argument and returns a string that summarizes its shape.\n", + "You can download it from http://thinkpython2.com/code/structshape.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s the result for a simple list:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from structshape import structshape\n", + "t = [1, 2, 3]\n", + "structshape(t)\n", + "'list of 3 int'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A fancier program might write “list of 3 ints”, but it\n", + "was easier not to deal with plurals. Here’s a list of lists:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t2 = [[1,2], [3,4], [5,6]]\n", + "structshape(t2)\n", + "'list of 3 list of 2 int'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the elements of the list are not the same type,\n", + "structshape groups them, in order, by type:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t3 = [1, 2, 3, 4.0, '5', '6', [7], [8], 9]\n", + "structshape(t3)\n", + "'list of (3 int, float, 2 str, 2 list of int, int)'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Here’s a list of tuples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'abc'\n", + "lt = list(zip(t, s))\n", + "structshape(lt)\n", + "'list of 3 tuple of (int, str)'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "And here’s a dictionary with 3 items that map integers to strings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "d = dict(lt) \n", + "structshape(d)\n", + "'dict of 3 int->str'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If you are having trouble keeping track of your data structures,\n", + "structshape can help." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12.9 Glossary" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import ipytracer\n", + "from IPython.core.display import display\n", + "\n", + "def bubble_sort(unsorted_list):\n", + " x = ipytracer.ChartTracer(unsorted_list)\n", + " display(x)\n", + " length = len(x)-1\n", + " for i in range(length):\n", + " for j in range(length-i):\n", + " if x[j] > x[j+1]:\n", + " x[j], x[j+1] = x[j+1], x[j]\n", + " return x.tolist()\n", + "\n", + "bubble_sort([6,4,7,9])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import ipytracer\n", + "from IPython.core.display import display\n", + "\n", + "def bubble_sort(unsorted_list):\n", + " x = ipytracer.List1DTracer(unsorted_list)\n", + " display(x)\n", + " length = len(x)-1\n", + " for i in range(length):\n", + " for j in range(length-i):\n", + " if x[j] > x[j+1]:\n", + " x[j], x[j+1] = x[j+1], x[j]\n", + " print(unsorted_list) \n", + " return x.tolist()\n", + "\n", + "bubble_sort([6,4,7,9])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import ipytracer\n", + "from IPython.core.display import display\n", + "import re\n", + "\n", + " \n", + "def quick_sort(arr): \n", + " input_list = ipytracer.ChartTracer(arr)\n", + " display(input_list)\n", + "\n", + " def alphanum_key(key):\n", + " return [int(s) if s.isdigit() else s.lower() for s in re.split(\"([0-9]+)\", key)]\n", + "\n", + " return sorted(input_list, key=alphanum_key)\n", + "\n", + "\n", + "quick_sort(['6','4','7','9','3','5','1','8'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "def merge_sort(collectionx: list) -> list:\n", + " collectionx = ipytracer.List1DTracer(collectionx)\n", + " display(collectionx)\n", + " \n", + " for i in range(0, 8):\n", + " collectionx[i] = i\n", + " collectionx[i-1] = i-1\n", + " collectionx[i-2] = i*2\n", + "\n", + "\n", + "merge_sort([6,4,7,9,3,5,1,8,2])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def merge_sort(collection: list) -> list:\n", + "\n", + "\n", + " def merge(left: list, right: list) -> list:\n", + " \"\"\"merge left and right\n", + " :param left: left collection\n", + " :param right: right collection\n", + " :return: merge result\n", + " \"\"\"\n", + "\n", + " def _merge():\n", + " while left and right:\n", + " yield (left if left[0] <= right[0] else right).pop(0)\n", + " yield from left\n", + " yield from right\n", + "\n", + " return list(_merge())\n", + "\n", + " if len(collection) <= 1:\n", + " return collection\n", + " mid = len(collection) // 2\n", + " display(ipytracer.List1DTracer(collection))\n", + " left = merge_sort(collection[:mid])\n", + " right = merge_sort(collection[mid:])\n", + " x = merge(left, right)\n", + " display(x)\n", + " return merge(left, right)\n", + "\n", + "merge_sort([6,4,7,9,3,5,1,8,2])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def shell_sort(collection):\n", + " collection = ipytracer.List1DTracer(collection)\n", + " display(collection)\n", + " gaps = [701, 301, 132, 57, 23, 10, 4, 1]\n", + "\n", + " for gap in gaps:\n", + " for i in range(gap, len(collection)):\n", + " j = i\n", + " while j >= gap and collection[j] < collection[j - gap]:\n", + " collection[j], collection[j - gap] = collection[j - gap], collection[j]\n", + " j -= gap\n", + " return collection\n", + "\n", + "shell_sort([6,4,7,9,3,5,1,8,2])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_7__Iteration.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_7__Iteration.ipynb new file mode 100644 index 0000000..82a8cd7 --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_7__Iteration.ipynb @@ -0,0 +1,1198 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 7  Iteration" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* 7.1  Reassignment\n", + "* 7.2  Updating variables\n", + "* 7.3  The while statement\n", + "* 7.4  break\n", + "* 7.5  Square roots\n", + "* 7.6  Algorithms\n", + "* 7.7  Debugging - Demo break the problem in half\n", + "\n", + "\n", + "This chapter is about iteration, which is the ability to run\n", + "a block of statements repeatedly. We saw a kind of iteration,\n", + "using recursion, in Section 5.8.\n", + "We saw another kind, using a for loop,\n", + "in Section 4.2. In this chapter we’ll see yet another\n", + "kind, using a while statement." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.1 Reassignment" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But first I want to say a little more about variable assignment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you may have discovered, it is legal to make more than one\n", + "assignment to the same variable. A new assignment makes an existing\n", + "variable refer to a new value (and stop referring to the old value)." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "5" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = 5\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "7" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = 7\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first time we display \n", + "x, its value is 5; the second time, its\n", + "value is 7." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Figure 7.1 shows what reassignment looks\n", + "like in a state diagram. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point I want to address a common source of\n", + "confusion.\n", + "**Because Python uses the equal sign (=) for assignment, it is\n", + "tempting to interpret a statement like a = b as a\n", + "mathematical\n", + "proposition of equality**; that is, the claim that a and\n", + "b are equal. But this interpretation is wrong.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, equality is a symmetric relationship and assignment is not. For\n", + "example, in mathematics, if `a=7` then `7=a`. But in Python, the\n", + "statement `a = 7` is legal and `7 = a` is not." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also, in mathematics, a proposition of equality is either true or\n", + "false for all time. If `a=b` now, then a will always equal b.\n", + "In Python, an assignment statement can make two variables equal, but\n", + "they don’t have to stay that way:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "a = 5" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "b = a # a and b are now equal" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "a = 3 # are a and b equal ?" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "5" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The third line changes the value of a but does not change the\n", + "value of b, so they are no longer equal. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Reassigning variables is often useful, but you should use it\n", + "with caution. If the values of variables change frequently, it can\n", + "make the code difficult to read and debug." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: Python Constant\n", + "\n", + "https://docs.python.org/3/library/typing.html#typing.Final " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MAX_SIZE: Final = 9000\n", + "MAX_SIZE += 1 # Error reported by type checker\n", + "\n", + "class Connection:\n", + " TIMEOUT: Final[int] = 10\n", + "\n", + "class FastConnector(Connection):\n", + " TIMEOUT = 1 # Error reported by type checker" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.2 Updating variables" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A common kind of reassignment is an update,\n", + "where the new value of the variable depends on the old." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "x = x + 1" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "8" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This means “get the current value of x, add one, and then\n", + "update x with the new value.”" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you try to update a variable that doesn’t exist, you get an\n", + "error, because Python evaluates the right side before it assigns\n", + "a value to x:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "8" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'x' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdel\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mNameError\u001b[0m: name 'x' is not defined" + ] + } + ], + "source": [ + "del x\n", + "x = x + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Before you can update a variable, you have to initialize\n", + "it, usually with a simple assignment:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "x = 0\n", + "x = x + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Updating a variable by adding 1 is called an increment;\n", + "subtracting 1 is called a decrement.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: Why are there no ++ and --​ operators in Python?\n", + "\n", + "https://stackoverflow.com/questions/3654830/why-are-there-no-and-operators-in-python\n", + "\n", + "1) Simple increment and decrement aren't needed as much as in other languages. You don't write things like \n", + "`for(int i = 0; i < 10; ++i)` \n", + "in Python very often; instead you do things like \n", + "`for i in range(0, 10)`.\n", + "\n", + "2) Python is a lot about **clarity** and no programmer is likely to correctly guess the meaning of --a unless s/he's learned a language having that construct." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x++\n", + "++x" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x+=1\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x-=1\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.3 The while statement" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Computers are often used to automate repetitive tasks. Repeating\n", + "identical or similar tasks without making errors is something that\n", + "computers do well and people do poorly. In a computer program,\n", + "repetition is also called iteration." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have already seen two functions, countdown and\n", + "print_n, that iterate using recursion. Because iteration is so\n", + "common, Python provides language features to make it easier.\n", + "One is the for statement we saw in Section 4.2.\n", + "We’ll get back to that later." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another is the while statement. Here is a version of countdown that uses a while statement:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def countdown(n):\n", + " while n > 0:\n", + " print(n)\n", + " n = n - 1\n", + " print('Blastoff!')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "You can almost read the while statement as if it were English.\n", + "It means, “While n is greater than 0,\n", + "display the value of n and then decrement\n", + "n. When you get to 0, display the word Blastoff!”\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "More formally, here is the flow of execution for a while statement:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This type of flow is called a loop because the third step\n", + "loops back around to the top. \n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The body of the loop should change the value of one or more variables\n", + "so that the condition becomes false eventually and the loop\n", + "terminates. Otherwise the loop will repeat forever, which is called\n", + "an infinite loop. An endless source of amusement for computer\n", + "scientists is the observation that the directions on shampoo,\n", + "“Lather, rinse, repeat”, are an infinite loop.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the case of countdown, we can prove that the loop\n", + "terminates: if n is zero or negative, the loop never runs.\n", + "Otherwise, n gets smaller each time through the\n", + "loop, so eventually we have to get to 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For some other loops, it is not so easy to tell. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def sequence(n):\n", + " while n != 1:\n", + " print(n)\n", + " if n % 2 == 0: # n is even\n", + " n = n / 2\n", + " else: # n is odd\n", + " n = n*3 + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The condition for this loop is n != 1, so the loop will continue\n", + "until n is 1, which makes the condition false." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each time through the loop, the program outputs the value of n\n", + "and then checks whether it is even or odd. If it is even, n is\n", + "divided by 2. If it is odd, the value of n is replaced with\n", + "n*3 + 1. For example, if the argument passed to sequence\n", + "is 3, the resulting values of n are 3, 10, 5, 16, 8, 4, 2, 1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since n sometimes increases and sometimes decreases, there is no\n", + "obvious proof that n will ever reach 1, or that the program\n", + "terminates. For some particular values of n, we can prove\n", + "termination. For example, if the starting value is a power of two,\n", + "n will be even every time through the loop\n", + "until it reaches 1. The previous example ends with such a sequence,\n", + "starting with 16.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The hard question is whether we can prove that this program terminates\n", + "for all positive values of n. So far, no one has\n", + "been able to prove it or disprove it! (See\n", + "http://en.wikipedia.org/wiki/Collatz_conjecture.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an exercise, rewrite the function print_n from\n", + "Section 5.8 using iteration instead of recursion." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: Collatz conjecture sequence" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "def collatz_sequence(x):\n", + " seq = [x]\n", + " if x < 1:\n", + " return []\n", + " while x > 1:\n", + " if x % 2 == 0:\n", + " x = x / 2\n", + " else:\n", + " x = 3 * x + 1 \n", + " seq.append(x)\n", + " return seq" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[5, 16, 8.0, 4.0, 2.0, 1.0]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "collatz_sequence(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/ec22031bdc2a1ab2e4effe47ae75a836e7dea459)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Resources\n", + "\n", + "* [Project Euler is a series of challenging mathematical/computer programming problems ](https://projecteuler.net/)\n", + "* [Collatz Conjecture in Color - Numberphile](https://www.youtube.com/watch?v=LqKpkdRRLZw)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.4 break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes you don’t know it’s time to end a loop until you get half\n", + "way through the body. In that case you can use the break\n", + "statement to jump out of the loop." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example, suppose you want to take input from the user until they\n", + "type done. You could write:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "> d\n", + "d\n", + "> wed\n", + "wed\n", + "> done\n", + "Done!\n" + ] + } + ], + "source": [ + "while True:\n", + " line = input('> ')\n", + " if line == 'done':\n", + " break\n", + " print(line)\n", + "\n", + "print('Done!')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The loop condition is True, which is always true, so the\n", + "loop runs until it hits the break statement." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each time through, it prompts the user with an angle bracket.\n", + "If the user types done, the break statement exits\n", + "the loop. Otherwise the program echoes whatever the user types\n", + "and goes back to the top of the loop. Here’s a sample run:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "> not done\n", + "not done\n", + "> done\n", + "Done!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This way of writing while loops is common because you\n", + "can check the condition anywhere in the loop (not just at the\n", + "top) and you can express the stop condition affirmatively\n", + "(“stop when this happens”) rather than negatively (“keep going\n", + "until that happens”)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.5 Square roots" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Loops are often used in programs that compute\n", + "numerical results by starting with an approximate answer and\n", + "iteratively improving it.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For example, one way of computing square roots is Newton’s method.\n", + "Suppose that you want to know the square root of a. If you start\n", + "with almost any estimate, x, you can compute a better\n", + "estimate with the following formula:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "For example, if a is 4 and x is 3:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.1666666666666665" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = 4\n", + "x = 3\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The result is closer to the correct answer (√4 = 2). If we\n", + "repeat the process with the new estimate, it gets even closer:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0064102564102564" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = y\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "After a few more updates, the estimate is almost exact:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0000102400262145" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = y\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0000000000262146" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = y\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In general we don’t know ahead of time how many steps it takes\n", + "to get to the right answer, but we know when we get there\n", + "because the estimate\n", + "stops changing:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = y\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2.0" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x = y\n", + "y = (x + a/x) / 2\n", + "y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "When y == x, we can stop. Here is a loop that starts\n", + "with an initial estimate, x, and improves it until it\n", + "stops changing:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "while True:\n", + " print(x)\n", + " y = (x + a/x) / 2\n", + " if y == x:\n", + " break\n", + " x = y" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "For most values of a this works fine, but in general it is\n", + "dangerous to test float equality.\n", + "Floating-point values are only approximately right:\n", + "most rational numbers, like 1/3, and irrational numbers, like\n", + "√2, can’t be represented exactly with a float.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rather than checking whether x and y are exactly equal, it\n", + "is safer to use the built-in function abs to compute the\n", + "absolute value, or magnitude, of the difference between them:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + " if abs(y-x) < epsilon:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Where epsilon has a value like 0.0000001 that\n", + "determines how close is close enough." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.6 Algorithms" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Newton’s method is an example of an algorithm: it is a\n", + "mechanical process for solving a category of problems (in this\n", + "case, computing square roots)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To understand what an algorithm is, it might help to start with\n", + "something that is not an algorithm. When you learned to multiply\n", + "single-digit numbers, you probably memorized the multiplication table.\n", + "In effect, you memorized 100 specific solutions. That kind of\n", + "knowledge is not algorithmic." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But if you were “lazy”, you might have learned a few\n", + "tricks. For example, to find the product of n and 9, you can\n", + "write n−1 as the first digit and 10−n as the second\n", + "digit. This trick is a general solution for multiplying any\n", + "single-digit number by 9. That’s an algorithm!\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, the techniques you learned for addition with carrying,\n", + "subtraction with borrowing, and long division are all algorithms. One\n", + "of the characteristics of algorithms is that they do not require any\n", + "intelligence to carry out. They are mechanical processes where\n", + "each step follows from the last according to a simple set of rules." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Executing algorithms is boring, but designing them is interesting,\n", + "intellectually challenging, and a central part of computer science." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some of the things that people do naturally, without difficulty or\n", + "conscious thought, are the hardest to express algorithmically.\n", + "Understanding natural language is a good example. We all do it, but\n", + "so far no one has been able to explain how we do it, at least\n", + "not in the form of an algorithm." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7.7 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you start writing bigger programs, you might find yourself\n", + "spending more time debugging. More code means more chances to\n", + "make an error and more places for bugs to hide.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way to cut your debugging time is “debugging by bisection”.\n", + "For example, if there are 100 lines in your program and you\n", + "check them one at a time, it would take 100 steps." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Instead, try to break the problem in half. Look at the middle\n", + "of the program, or near it, for an intermediate value you\n", + "can check. Add a print statement (or something else\n", + "that has a verifiable effect) and run the program." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the mid-point check is incorrect, there must be a problem in the\n", + "first half of the program. If it is correct, the problem is\n", + "in the second half." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every time you perform a check like this, you halve the number of\n", + "lines you have to search. After six steps (which is fewer than 100),\n", + "you would be down to one or two lines of code, at least in theory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In practice it is not always clear what\n", + "the “middle of the program” is and not always possible to\n", + "check it. It doesn’t make sense to count lines and find the\n", + "exact midpoint. Instead, think about places\n", + "in the program where there might be errors and places where it\n", + "is easy to put a check. Then choose a spot where you\n", + "think the chances are about the same that the bug is before\n", + "or after the check." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_8__Strings.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_8__Strings.ipynb new file mode 100644 index 0000000..56f6504 --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_8__Strings.ipynb @@ -0,0 +1,1627 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 8  Strings\n", + "\n", + "* 8.1  A string is a sequence\n", + "* 8.2  len\n", + "* 8.3  Traversal with a for loop\n", + "* 8.4  String slices\n", + "* 8.5  Strings are immutable\n", + "* 8.6  Searching\n", + "* 8.7  Looping and counting\n", + "* 8.8  String methods\n", + "* 8.9  The in operator\n", + "* 8.10  String comparison\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://en.wikipedia.org/wiki/ASCII\n", + "![strings_in_python](strings_in_python.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.1 A string is a sequence" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Strings are not like integers, floats, and booleans. A string\n", + "is a sequence, which means it is\n", + "an ordered collection of other values. In this chapter you’ll see\n", + "how to access the characters that make up a string, and you’ll\n", + "learn about some of the methods strings provide.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "A string is a sequence of characters. \n", + "You can access the characters one at a time with the\n", + "bracket operator:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "fruit = 'banana'\n", + "letter = fruit[1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The second statement selects character number 1 from fruit and assigns it to letter. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The expression in brackets is called an index. \n", + "The index indicates which character in the sequence you\n", + "want (hence the name)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But you might not get what you expect:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'a'" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "letter" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "For most people, the first letter of 'banana' is b, not\n", + "a. But for computer scientists, the index is an offset from the\n", + "beginning of the string, and the offset of the first letter is zero." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'b'" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "letter = fruit[0]\n", + "letter" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "So b is the 0th letter (“zero-eth”) of 'banana', a is the 1th letter (“one-eth”), and n is the 2th letter\n", + "(“two-eth”). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an index you can use an expression that contains variables and\n", + "operators:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'a'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "i = 1\n", + "fruit[i]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'n'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fruit[i+1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But the value of the index has to be an integer. Otherwise you\n", + "get:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "ename": "TypeError", + "evalue": "string indices must be integers", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mletter\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfruit\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1.5\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m: string indices must be integers" + ] + } + ], + "source": [ + "letter = fruit[1.5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.2 len" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "len is a built-in function that returns the number of characters\n", + "in a string:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fruit = 'banana'\n", + "len(fruit)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "To get the last letter of a string, you might be tempted to try something\n", + "like this:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "length = len(fruit)\n", + "last = fruit[length]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The reason for the IndexError is that there is no letter in ’banana’ with the index 6. Since we started counting at zero, the\n", + "six letters are numbered 0 to 5. To get the last character, you have\n", + "to subtract 1 from length:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "last = fruit[length-1]\n", + "last" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Or you can use negative indices, which count backward from\n", + "the end of the string. The expression fruit[-1] yields the last\n", + "letter, fruit[-2] yields the second to last, and so on.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fruit[-1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.3 Traversal with a for loop" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A lot of computations involve processing a string one character at a\n", + "time. Often they start at the beginning, select each character in\n", + "turn, do something to it, and continue until the end. This pattern of\n", + "processing is called a traversal. One way to write a traversal\n", + "is with a while loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = 0\n", + "while index < len(fruit):\n", + " letter = fruit[index]\n", + " print(letter, end='')\n", + " index = index + 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This loop traverses the string and displays each letter on a line by\n", + "itself. The loop condition is index < len(fruit), so\n", + "when index is equal to the length of the string, the\n", + "condition is false, and the body of the loop doesn’t run. The\n", + "last character accessed is the one with the index len(fruit)-1,\n", + "which is the last character in the string." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an **exercise**, write a function that takes a string as an argument\n", + "and displays the letters backward, one per line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "index = len(fruit) -1\n", + "while index >= 0:\n", + " letter = fruit[index]\n", + " print(letter)\n", + " index -= 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way to write a traversal is with a for loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for letter in fruit:\n", + " print(letter)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Each time through the loop, the next character in the string is assigned\n", + "to the variable letter. The loop continues until no characters are\n", + "left.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following example shows how to use concatenation (string addition)\n", + "and a for loop to generate an abecedarian series (that is, in\n", + "alphabetical order). In Robert McCloskey’s book Make\n", + "Way for Ducklings, the names of the ducklings are Jack, Kack, Lack,\n", + "Mack, Nack, Ouack, Pack, and Quack. This loop outputs these names in\n", + "order:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prefixes = 'JKLMNOPQ'\n", + "suffix = 'ack'\n", + "\n", + "for letter in prefixes:\n", + " print(letter + suffix)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The output is:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Jack\n", + "Kack\n", + "Lack\n", + "Mack\n", + "Nack\n", + "Oack\n", + "Pack\n", + "Qack" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Of course, that’s not quite right because “Ouack” and “Quack” are\n", + "misspelled. As an exercise, modify the program to fix this error." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.4 String slices" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A segment of a string is called a slice. Selecting a slice is\n", + "similar to selecting a character:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = 'Monty Python'\n", + "s[0:5]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s[6:12]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The operator [n:m] returns the part of the string from the \n", + "“n-eth” character to the “m-eth” character, including the first but\n", + "excluding the last. This behavior is counterintuitive, but it might\n", + "help to imagine the indices pointing between the\n", + "characters, as in Figure 8.1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you omit the first index (before the colon), the slice starts at\n", + "the beginning of the string. If you omit the second index, the slice\n", + "goes to the end of the string:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fruit = 'banana'\n", + "fruit[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fruit[7:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "If the first index is greater than or equal to the second the result\n", + "is an empty string, represented by two quotation marks:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fruit = 'banana'\n", + "fruit[3:3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "An empty string contains no characters and has length 0, but other\n", + "than that, it is the same as any other string." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Continuing this example, what do you think \n", + "fruit[:] means? Try it and see.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: Extended Slices\n", + "\n", + "https://docs.python.org/2/whatsnew/2.3.html#extended-slices\n", + "\n", + "[begin:end:step]\n", + "\n", + "* leaving begin and end off\n", + "* specify a step of -1" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ananab'" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fruit[::-1]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'bnn'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fruit[::2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.5 Strings are immutable" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is tempting to use the [] operator on the left side of an\n", + "assignment, with the intention of changing a character in a string.\n", + "For example:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "ename": "TypeError", + "evalue": "'str' object does not support item assignment", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mgreeting\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'Hello, world!'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mgreeting\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'J'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m: 'str' object does not support item assignment" + ] + } + ], + "source": [ + "greeting = 'Hello, world!'\n", + "greeting[0] = 'J'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The “object” in this case is the string and the “item” is\n", + "the character you tried to assign. For now, an object is\n", + "the same thing as a value, but we will refine that definition\n", + "later (Section 10.10). \n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The reason for the error is that\n", + "strings are immutable, which means you can’t change an\n", + "existing string. The best you can do is create a new string\n", + "that is a variation on the original:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Jello, world!'" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "greeting = 'Hello, world!'\n", + "new_greeting = 'J' + greeting[1:]\n", + "new_greeting" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'a5'" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'a' + str(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This example concatenates a new first letter onto\n", + "a slice of greeting. It has no effect on\n", + "the original string.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.6 Searching" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What does the following function do?\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def find(word, letter):\n", + " index = 0\n", + " while index < len(word):\n", + " if word[index] == letter:\n", + " return index\n", + " index = index + 1\n", + " return -1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In a sense, find is the inverse of the [] operator.\n", + "Instead of taking an index and extracting the corresponding character,\n", + "it takes a character and finds the index where that character\n", + "appears. If the character is not found, the function returns -1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is the first example we have seen of a return statement\n", + "inside a loop. If word[index] == letter, the function breaks\n", + "out of the loop and returns immediately." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the character doesn’t appear in the string, the program\n", + "exits the loop normally and returns -1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This pattern of computation—traversing a sequence and returning\n", + "when we find what we are looking for—is called a search.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As an exercise, modify find so that it has a\n", + "third parameter, the index in word where it should start\n", + "looking." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.7 Looping and counting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following program counts the number of times the letter a\n", + "appears in a string:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3\n" + ] + } + ], + "source": [ + "word = 'banana'\n", + "count = 0\n", + "for letter in word:\n", + " if letter == 'a':\n", + " count = count + 1\n", + "print(count)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This program demonstrates another pattern of computation called a counter. The variable count is initialized to 0 and then\n", + "incremented each time an a is found.\n", + "When the loop exits, count\n", + "contains the result—the total number of a’s." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "As an exercise, encapsulate this code in a function named count, and generalize it so that it accepts the string and the\n", + "letter as arguments." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Then rewrite the function so that instead of\n", + "traversing the string, it uses the three-parameter version of find from the previous section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.8 String methods" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Strings provide methods that perform a variety of useful operations.\n", + "A method is similar to a function—it takes arguments and\n", + "returns a value—but the syntax is different. For example, the\n", + "method upper takes a string and returns a new string with\n", + "all uppercase letters.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Instead of the function syntax upper(word), it uses\n", + "the method syntax word.upper()." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'BANANA'" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word = 'banana'\n", + "new_word = word.upper()\n", + "new_word" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'banana'" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_word.lower()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This form of dot notation specifies the name of the method, upper, and the name of the string to apply the method to, word. The empty parentheses indicate that this method takes no\n", + "arguments.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A method call is called an invocation; in this case, we would\n", + "say that we are invoking upper on word.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As it turns out, there is a string method named find that\n", + "is remarkably similar to the function we wrote:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word = 'banana'\n", + "index = word.find('a')\n", + "index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "In this example, we invoke find on word and pass\n", + "the letter we are looking for as a parameter." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Actually, the find method is more general than our function;\n", + "it can find substrings, not just characters:" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word.find('na')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "By default, find starts at the beginning of the string, but\n", + "it can take a second argument, the index where it should start:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word.find('na', 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This is an example of an optional argument;\n", + "find can\n", + "also take a third argument, the index where it should stop:" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "-1" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "name = 'bob'\n", + "name.find('b', 1, 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This search fails because b does not\n", + "appear in the index range from 1 to 2, not including 2. Searching up to, but not including, the second index makes\n", + "find consistent with the slice operator." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus\n", + "\n", + "Split\n", + "https://docs.python.org/2/library/string.html#string.split\n", + "\n", + "Built-in Functions\n", + "https://docs.python.org/3/library/functions.html" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Monty Python, Monty Python']" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s = 'Monty Python, Monty Python'\n", + "s.split('$')" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'nbananaobananahbananatbananaybananaPbanana bananaybananatbanananbananaobananaMbanana banana,banananbananaobananahbananatbananaybananaPbanana bananaybananatbanananbananaobananaM'" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fruit.join(reversed(s))" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'a,n,a,n,a,b'" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "','.join(reversed(fruit))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.9 The in operator" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The word in is a boolean operator that takes two strings and\n", + "returns True if the first appears as a substring in the second:" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'a' in 'banana'" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'seed' in 'banana'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "For example, the following function prints all the\n", + "letters from word1 that also appear in word2:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [], + "source": [ + "def in_both(word1, word2):\n", + " for letter in word1:\n", + " if letter in word2:\n", + " print(letter)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "With well-chosen variable names,\n", + "Python sometimes reads like English. You could read\n", + "this loop, “for (each) letter in (the first) word, if (the) letter \n", + "(appears) in (the second) word, print (the) letter.”" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here’s what you get if you compare apples and oranges:" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "a\n", + "e\n", + "s\n" + ] + } + ], + "source": [ + "in_both('apples', 'oranges')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.10 String comparison" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The relational operators work on strings. To see if two strings are equal:" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All right, bananas.\n" + ] + } + ], + "source": [ + "if word == 'banana':\n", + " print('All right, bananas.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Other relational operations are useful for putting words in alphabetical\n", + "order:" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All right, bananas.\n" + ] + } + ], + "source": [ + "if word < 'banana':\n", + " print('Your word, ' + word + ', comes before banana.')\n", + "elif word > 'banana':\n", + " print('Your word, ' + word + ', comes after banana.')\n", + "else:\n", + " print('All right, bananas.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Python does not handle uppercase and lowercase letters the same way\n", + "people do. All the uppercase letters come before all the\n", + "lowercase letters, so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Your word, Pineapple, comes before banana." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "A common way to address this problem is to convert strings to a\n", + "standard format, such as all lowercase, before performing the\n", + "comparison. Keep that in mind in case you have to defend yourself\n", + "against a man armed with a Pineapple." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.11 Debugging" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you use indices to traverse the values in a sequence,\n", + "it is tricky to get the beginning and end of the traversal\n", + "right. Here is a function that is supposed to compare two\n", + "words and return True if one of the words is the reverse\n", + "of the other, but it contains two errors:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_reverse(word1, word2):\n", + " if len(word1) != len(word2):\n", + " return False\n", + " \n", + " i = 0\n", + " j = len(word2)\n", + "\n", + " while j > 0:\n", + " if word1[i] != word2[j]:\n", + " return False\n", + " i = i+1\n", + " j = j-1\n", + "\n", + " return True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first if statement checks whether the words are the\n", + "same length. If not, we can return False immediately.\n", + "Otherwise, for the rest of the function, we can assume that the words\n", + "are the same length. This is an example of the guardian pattern\n", + "in Section 6.8.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "i and j are indices: i traverses word1\n", + "forward while j traverses word2 backward. If we find\n", + "two letters that don’t match, we can return False immediately.\n", + "If we get through the whole loop and all the letters match, we\n", + "return True." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we test this function with the words “pots” and “stop”, we\n", + "expect the return value True, but we get an IndexError:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_reverse('pots', 'stop')\n", + "...\n", + " File \"reverse.py\", line 15, in is_reverse\n", + " if word1[i] != word2[j]:\n", + "IndexError: string index out of range" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "For debugging this kind of error, my first move is to\n", + "print the values of the indices immediately before the line\n", + "where the error appears." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + " while j > 0:\n", + " print(i, j) # print here\n", + " \n", + " if word1[i] != word2[j]:\n", + " return False\n", + " i = i+1\n", + " j = j-1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Now when I run the program again, I get more information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_reverse('pots', 'stop')\n", + "0 4\n", + "...\n", + "IndexError: string index out of range" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first time through the loop, the value of j is 4,\n", + "which is out of range for the string 'pots'.\n", + "The index of the last character is 3, so the\n", + "initial value for j should be len(word2)-1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If I fix that error and run the program again, I get:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "is_reverse('pots', 'stop')\n", + "0 3\n", + "1 2\n", + "2 1\n", + "True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This time we get the right answer, but it looks like the loop only ran\n", + "three times, which is suspicious. To get a better idea of what is\n", + "happening, it is useful to draw a state diagram. During the first\n", + "iteration, the frame for is_reverse is shown in\n", + "Figure 8.2. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I took some license by arranging the variables in the frame\n", + "and adding dotted lines to show that the values of i and\n", + "j indicate characters in word1 and word2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Starting with this diagram, run the program on paper, changing the\n", + "values of i and j during each iteration. Find and fix the\n", + "second error in this function.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8.12 Glossary" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Books/Think Python/Think_Python_Chapter_9__Case_study_A_word_play.ipynb b/notebooks/Books/Think Python/Think_Python_Chapter_9__Case_study_A_word_play.ipynb new file mode 100644 index 0000000..6eac903 --- /dev/null +++ b/notebooks/Books/Think Python/Think_Python_Chapter_9__Case_study_A_word_play.ipynb @@ -0,0 +1,1405 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Chapter 9  Case study: word play" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.1 Reading word lists" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This chapter presents the second case study, which involves\n", + "solving word puzzles by searching for words that have certain\n", + "properties. For example, we’ll find the longest palindromes\n", + "in English and search for words whose letters appear in\n", + "alphabetical order. And I will present another program development\n", + "plan: reduction to a previously solved problem." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the exercises in this chapter we need a list of English words.\n", + "There are lots of word lists available on the Web, but the one most\n", + "suitable for our purpose is one of the word lists collected and\n", + "contributed to the public domain by Grady Ward as part of the Moby\n", + "lexicon project (see http://wikipedia.org/wiki/Moby_Project). It\n", + "is a list of 113,809 official crosswords; that is, words that are\n", + "considered valid in crossword puzzles and other word games. In the\n", + "Moby collection, the filename is 113809of.fic; you can download\n", + "a copy, with the simpler name words.txt, from\n", + "http://thinkpython2.com/code/words.txt.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This file is in plain text, so you can open it with a text\n", + "editor, but you can also read it from Python. The built-in\n", + "function open takes the name of the file as a parameter\n", + "and returns a file object you can use to read the file.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "fin = open('words.txt')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<_io.TextIOWrapper name='words.txt' mode='r' encoding='UTF-8'>" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fin" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "fin is a common name for a file object used for input. The file\n", + "object provides several methods for reading, including readline,\n", + "which reads characters from the file until it gets to a newline and\n", + "returns the result as a string: \n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'aa\\n'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fin.readline()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The first word in this particular list is “aa”, which is a kind of\n", + "lava. The sequence \\n represents the newline character that \n", + "separates this word from the next." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The file object keeps track of where it is in the file, so\n", + "if you call readline again, you get the next word:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'aah\\n'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "fin.readline()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The next word is “aah”, which is a perfectly legitimate\n", + "word, so stop looking at me like that.\n", + "Or, if it’s the newline character that’s bothering you,\n", + "we can get rid of it with the string method strip:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'aahed'" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "line = fin.readline()\n", + "word = line.strip()\n", + "word" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "You can also use a file object as part of a for loop.\n", + "This program reads words.txt and prints each word, one\n", + "per line:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "zymurgy\n" + ] + } + ], + "source": [ + "fin = open('words.txt')\n", + "for line in fin:\n", + " word = line.strip()\n", + " #print(word)\n", + "print(word)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.2 Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are solutions to these exercises in the next section.\n", + "You should at least attempt each one before you read the solutions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 1** Write a program that reads words.txt and prints only the words with more than 20 characters (not counting whitespace)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
word
0aa
1aah
2aahed
3aahing
4aahs
\n", + "
" + ], + "text/plain": [ + " word\n", + "0 aa\n", + "1 aah\n", + "2 aahed\n", + "3 aahing\n", + "4 aahs" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "words = pd.read_csv('words.txt', names=['word'])\n", + "\n", + "words.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(113809, 1)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
word
21685counterdemonstrations
47408hyperaggressivenesses
60406microminiaturizations
\n", + "
" + ], + "text/plain": [ + " word\n", + "21685 counterdemonstrations\n", + "47408 hyperaggressivenesses\n", + "60406 microminiaturizations" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words[words['word'].str.len() > 20]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 2** \n", + "In 1939 Ernest Vincent Wright published a 50,000 word novel called Gadsby that does not contain the letter “e”. Since “e” is the most common letter in English, that’s not easy to do.\n", + "\n", + "In fact, it is difficult to construct a solitary thought without using that most common symbol. It is slow going at first, but with caution and hours of training you can gradually gain facility.\n", + "\n", + "All right, I’ll stop now.\n", + "\n", + "Write a function called has_no_e that returns True if the given word doesn’t have the letter “e” in it.\n", + "\n", + "Write a program that reads words.txt and prints only the words that have no “e”. Compute the percentage of words in the list that have no “e”." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(37641, 1)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words[~words['word'].fillna('_').str.contains('e')].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(76168, 1)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words[words['word'].fillna('_').str.contains('e')].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
word
113800zymogenes
113801zymogens
113802zymologies
113804zymoses
113807zymurgies
\n", + "
" + ], + "text/plain": [ + " word\n", + "113800 zymogenes\n", + "113801 zymogens\n", + "113802 zymologies\n", + "113804 zymoses\n", + "113807 zymurgies" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "words[words['word'].fillna('_').str.contains('e')].tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 3**\n", + "Write a function named avoids that takes a word and a string of forbidden letters, and that returns True if the word doesn’t use any of the forbidden letters.\n", + "\n", + "Write a program that prompts the user to enter a string of forbidden letters and then prints the number of words that don’t contain any of them. Can you find a combination of 5 forbidden letters that excludes the smallest number of words?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 4** \n", + "Write a function named uses_only that takes a word and a string of letters, and that returns True if the word contains only letters in the list. Can you make a sentence using only the letters acefhlo? Other than “Hoe alfalfa”?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 5** \n", + "Write a function named uses_all that takes a word and a string of required letters, and that returns True if the word uses all the required letters at least once. How many words are there that use all the vowels aeiou? How about aeiouy?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise 6**\n", + "Write a function called is_abecedarian that returns True if the letters in a word appear in alphabetical order (double letters are ok). How many abecedarian words are there?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.3 Search" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Pixiedust database opened successfully\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " \n", + " Pixiedust version 1.1.18\n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pixiedust" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All of the exercises in the previous section have something\n", + "in common; they can be solved with the search pattern we saw\n", + "in Section 8.6. The simplest example is:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "def has_no_e(word):\n", + " for letter in word:\n", + " if letter == 'e':\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "has_no_e('letter')" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "has_no_e('xxxx')" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "pixiedust": { + "displayParams": {} + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter
" + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "pixieapp_metadata": null + }, + "output_type": "display_data" + } + ], + "source": [ + "%%pixie_debugger\n", + "has_no_e('letter')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The for loop traverses the characters in word. If we find\n", + "the letter “e”, we can immediately return False; otherwise we\n", + "have to go to the next letter. If we exit the loop normally, that\n", + "means we didn’t find an “e”, so we return True.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "You could write this function more concisely using the in\n", + "operator, but I started with this version because it \n", + "demonstrates the logic of the search pattern." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "avoids is a more general version of has_no_e but it\n", + "has the same structure:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "def avoids(word, forbidden):\n", + " for letter in word:\n", + " if letter in forbidden:\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "avoids('hintw', 'wz')" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "pixiedust": { + "displayParams": {} + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter
" + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "pixieapp_metadata": null + }, + "output_type": "display_data" + } + ], + "source": [ + "%%pixie_debugger\n", + "avoids('hint', 'wz')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can return False as soon as we find a forbidden letter;\n", + "if we get to the end of the loop, we return True." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "uses_only is similar except that the sense of the condition\n", + "is reversed:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "def uses_only(word, available):\n", + " for letter in word: \n", + " if letter not in available:\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "uses_only('hinth', 'inth')" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "pixiedust": { + "displayParams": {} + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter
" + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "pixieapp_metadata": null + }, + "output_type": "display_data" + } + ], + "source": [ + "%%pixie_debugger\n", + "uses_only('hint', 'inh')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Instead of a list of forbidden letters, we have a list of available\n", + "letters. If we find a letter in word that is not in\n", + "available, we can return False." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "uses_all is similar except that we reverse the role\n", + "of the word and the string of letters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def uses_all(word, required):\n", + " for letter in required: \n", + " if letter not in word:\n", + " return False\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "uses_only('hinth', 'inth')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "pixiedust": { + "displayParams": {} + } + }, + "outputs": [], + "source": [ + "%%pixie_debugger\n", + "uses_only('hintt', 'inth')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Instead of traversing the letters in word, the loop\n", + "traverses the required letters. If any of the required letters\n", + "do not appear in the word, we can return False.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you were really thinking like a computer scientist, you would\n", + "have recognized that uses_all was an instance of a\n", + "previously solved problem, and you would have written:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def uses_all(word, required):\n", + " return uses_only(required, word)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This is an example of a program development plan called reduction to a previously solved problem, which means that you\n", + "recognize the problem you are working on as an instance of a solved\n", + "problem and apply an existing solution. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus\n", + "\n", + "How to check performance in Jupyter Notebook" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The slowest run took 9.60 times longer than the fastest. This could mean that an intermediate result is being cached.\n", + "10000000 loops, best of 3: 143 ns per loop\n" + ] + } + ], + "source": [ + "%%timeit \n", + "a = \"abc\"\n", + "b = \"abcdefghijklmnopqrstuvwxyz\"\n", + "for i in a:\n", + " if i in b: \n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1000000 loops, best of 3: 678 ns per loop\n" + ] + } + ], + "source": [ + "%%timeit \n", + "b = \"abc\"\n", + "a = \"abcdefghijklmnopqrstuvwxyz\"\n", + "for i in a:\n", + " if i in b: \n", + " pass" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9.4 Looping with indices" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I wrote the functions in the previous section with for\n", + "loops because I only needed the characters in the strings; I didn’t\n", + "have to do anything with the indices." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For is_abecedarian we have to compare adjacent letters,\n", + "which is a little tricky with a for loop:" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "def is_abecedarian(word):\n", + " previous = word[0]\n", + " for c in word:\n", + " if c < previous:\n", + " return False\n", + " previous = c\n", + " return True" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_abecedarian('hintt')" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "pixiedust": { + "displayParams": {} + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter
" + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "pixieapp_metadata": null + }, + "output_type": "display_data" + } + ], + "source": [ + "%%pixie_debugger\n", + "is_abecedarian('hintt')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An alternative is to use recursion:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_abecedarian(word):\n", + " if len(word) <= 1:\n", + " return True\n", + " if word[0] > word[1]:\n", + " return False\n", + " return is_abecedarian(word[1:])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another option is to use a while loop:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_abecedarian(word):\n", + " i = 0\n", + " while i < len(word)-1:\n", + " if word[i+1] < word[i]:\n", + " return False\n", + " i = i+1\n", + " return True" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "The loop starts at i=0 and ends when i=len(word)-1. Each\n", + "time through the loop, it compares the ith character (which you can\n", + "think of as the current character) to the i+1th character (which you\n", + "can think of as the next)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the next character is less than (alphabetically before) the current\n", + "one, then we have discovered a break in the abecedarian trend, and\n", + "we return False." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we get to the end of the loop without finding a fault, then the\n", + "word passes the test. To convince yourself that the loop ends\n", + "correctly, consider an example like 'flossy'. The\n", + "length of the word is 6, so\n", + "the last time the loop runs is when i is 4, which is the\n", + "index of the second-to-last character. On the last iteration,\n", + "it compares the second-to-last character to the last, which is\n", + "what we want.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is a version of is_palindrome (see\n", + "Exercise 3) that uses two indices; one starts at the\n", + "beginning and goes up; the other starts at the end and goes down." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def is_palindrome(word):\n", + " i = 0\n", + " j = len(word)-1\n", + "\n", + " while i\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_id
031114321
120103102
230004023
310002010
440235321
\n", + "" + ], + "text/plain": [ + " parent_id child_id\n", + "0 3111 4321\n", + "1 2010 3102\n", + "2 3000 4023\n", + "3 1000 2010\n", + "4 4023 5321" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "df = pd.DataFrame(\n", + " {\n", + " 'parent_id': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102, 2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033],\n", + " 'child_id': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001, 3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311]\n", + " }\n", + ")\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "parent_id int64\n", + "child_id int64\n", + "dtype: object" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. change type of a column\n", + "* int to str\n", + "* str to int" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "df.parent_id = df.parent_id.astype('str')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "parent_id object\n", + "child_id int64\n", + "dtype: object" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "df.parent_id = df.parent_id.astype('int')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "parent_id int64\n", + "child_id int64\n", + "dtype: object" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_id
count20.00000020.000000
mean2885.8500003971.400000
std1263.5017871327.918014
min1000.0000002010.000000
25%2077.5000003027.500000
50%3011.0000004016.500000
75%3339.0000004493.250000
max5010.0000006525.000000
\n", + "
" + ], + "text/plain": [ + " parent_id child_id\n", + "count 20.000000 20.000000\n", + "mean 2885.850000 3971.400000\n", + "std 1263.501787 1327.918014\n", + "min 1000.000000 2010.000000\n", + "25% 2077.500000 3027.500000\n", + "50% 3011.000000 4016.500000\n", + "75% 3339.000000 4493.250000\n", + "max 5010.000000 6525.000000" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. convert column to a category\n", + "\n", + "Two reasons for that\n", + "* performance - having small number of distinct values (lots of repetition in single column)\n", + "* sort - when the lexical order of a variable is not the same as the logical order " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "df.parent_id = df.parent_id.astype('category')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "parent_id category\n", + "child_id int64\n", + "dtype: object" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
child_id
count20.000000
mean3971.400000
std1327.918014
min2010.000000
25%3027.500000
50%4016.500000
75%4493.250000
max6525.000000
\n", + "
" + ], + "text/plain": [ + " child_id\n", + "count 20.000000\n", + "mean 3971.400000\n", + "std 1327.918014\n", + "min 2010.000000\n", + "25% 3027.500000\n", + "50% 4016.500000\n", + "75% 4493.250000\n", + "max 6525.000000" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "df.child_id = df.child_id.astype('category')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Int64Index([2010, 2100, 2110, 3000, 3011, 3033, 3102, 3111, 4001, 4010, 4023,\n", + " 4113, 4200, 4311, 4321, 5010, 5321, 6016, 6100, 6525],\n", + " dtype='int64')" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['child_id'].cat.categories" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Int64Index([1000, 2010, 2100, 2110, 3000, 3011, 3033, 3102, 3111, 4023, 5010], dtype='int64')" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['parent_id'].cat.categories" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for column in ['parent_id', 'child_id']:\n", + " df[col] = df[column].astype('int')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. create new column by n characters from another column\n", + "\n", + "* get last n characters\n", + "* get first n characters\n", + "* get n characters from the middle" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "df['parent_id_last'] = df.apply(lambda row: str(row['parent_id'])[-3:], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_last
031114321111
120103102010
230004023000
310002010000
440235321023
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last\n", + "0 3111 4321 111\n", + "1 2010 3102 010\n", + "2 3000 4023 000\n", + "3 1000 2010 000\n", + "4 4023 5321 023" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "df['parent_id_first'] = df.apply(lambda row: str(row['parent_id'])[:2], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_lastparent_id_first
03111432111131
12010310201020
23000402300030
31000201000010
44023532102340
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last parent_id_first\n", + "0 3111 4321 111 31\n", + "1 2010 3102 010 20\n", + "2 3000 4023 000 30\n", + "3 1000 2010 000 10\n", + "4 4023 5321 023 40" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "df['parent_id_middle'] = df.apply(lambda row: str(row['child_id'])[1:3], axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_lastparent_id_firstparent_id_middle
0311143211113132
1201031020102010
2300040230003002
3100020100001001
4402353210234032
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last parent_id_first parent_id_middle\n", + "0 3111 4321 111 31 32\n", + "1 2010 3102 010 20 10\n", + "2 3000 4023 000 30 02\n", + "3 1000 2010 000 10 01\n", + "4 4023 5321 023 40 32" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. combine two columns into another column" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "df['combined'] = df.apply(lambda row: str(row['parent_id']) + str(row['child_id']), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_lastparent_id_firstparent_id_middlecombined
031114321111313231114321
120103102010201020103102
230004023000300230004023
310002010000100110002010
440235321023403240235321
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last parent_id_first parent_id_middle combined\n", + "0 3111 4321 111 31 32 31114321\n", + "1 2010 3102 010 20 10 20103102\n", + "2 3000 4023 000 30 02 30004023\n", + "3 1000 2010 000 10 01 10002010\n", + "4 4023 5321 023 40 32 40235321" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "df['combined'] = df.apply(lambda row: str(row['parent_id'] + row['child_id']), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_lastparent_id_firstparent_id_middlecombined
03111432111131327432
12010310201020105112
23000402300030027023
31000201000010013010
44023532102340329344
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last parent_id_first parent_id_middle combined\n", + "0 3111 4321 111 31 32 7432\n", + "1 2010 3102 010 20 10 5112\n", + "2 3000 4023 000 30 02 7023\n", + "3 1000 2010 000 10 01 3010\n", + "4 4023 5321 023 40 32 9344" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "def comb(x, y):\n", + " return str(x) + str(y)\n", + "\n", + "df['comb'] = df.apply(lambda row: comb(row['parent_id'], row['child_id']), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_lastparent_id_firstparent_id_middlecombinedcomb
0311143211113132743231114321
1201031020102010511220103102
2300040230003002702330004023
3100020100001001301010002010
4402353210234032934440235321
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_last parent_id_first parent_id_middle combined \\\n", + "0 3111 4321 111 31 32 7432 \n", + "1 2010 3102 010 20 10 5112 \n", + "2 3000 4023 000 30 02 7023 \n", + "3 1000 2010 000 10 01 3010 \n", + "4 4023 5321 023 40 32 9344 \n", + "\n", + " comb \n", + "0 31114321 \n", + "1 20103102 \n", + "2 30004023 \n", + "3 10002010 \n", + "4 40235321 " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Dataframe_to_json_nested.ipynb b/notebooks/Dataframe_to_json_nested.ipynb new file mode 100644 index 0000000..2f1dee5 --- /dev/null +++ b/notebooks/Dataframe_to_json_nested.ipynb @@ -0,0 +1,1117 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## pandas DataFrame generate n-level hierarchical JSON\n", + "\n", + "* hierarchical data\n", + "* mapping pandas columns\n", + "* Pretty print json and dataframe split\n", + "* generate n-level hierarchical JSON" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_id
031114321
120103102
230004023
310002010
440235321
\n", + "
" + ], + "text/plain": [ + " parent_id child_id\n", + "0 3111 4321\n", + "1 2010 3102\n", + "2 3000 4023\n", + "3 1000 2010\n", + "4 4023 5321" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import json\n", + "df = pd.DataFrame(\n", + " {\n", + " 'parent_id': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102, 2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033],\n", + " 'child_id': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001, 3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311]\n", + " }\n", + ")\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "lst = json.loads(df.to_json(orient='split'))['data']\n", + "\n", + "# Build a directed graph and a list of all names that have no parent\n", + "graph = {name: set() for tup in lst for name in tup}\n", + "has_parent = {name: False for tup in lst for name in tup}\n", + "for parent, child in lst:\n", + " graph[parent].add(child)\n", + " has_parent[child] = True\n", + "\n", + "# All names that have absolutely no parent:\n", + "roots = [name for name, parents in has_parent.items() if not parents]\n", + "\n", + "# traversal of the graph (doesn't care about duplicates and cycles)\n", + "def traverse(hierarchy, graph, names):\n", + " for name in names:\n", + " hierarchy[name] = traverse({}, graph, graph[name])\n", + " return hierarchy\n", + "\n", + "result = traverse({}, graph, roots)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"1000\": {\n", + " \"2010\": {\n", + " \"3011\": {\n", + " \"4200\": {},\n", + " \"4010\": {}\n", + " },\n", + " \"3102\": {\n", + " \"4001\": {}\n", + " }\n", + " },\n", + " \"2100\": {\n", + " \"3033\": {\n", + " \"4113\": {},\n", + " \"4311\": {}\n", + " }\n", + " },\n", + " \"2110\": {\n", + " \"3000\": {\n", + " \"4023\": {\n", + " \"5321\": {},\n", + " \"5010\": {\n", + " \"6016\": {},\n", + " \"6100\": {},\n", + " \"6525\": {}\n", + " }\n", + " }\n", + " },\n", + " \"3111\": {\n", + " \"4321\": {}\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "import json\n", + "print(json.dumps(result, indent=2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Column Mapping" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "df.parent_id = df.parent_id.astype('category')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "df.child_id = df.child_id.astype('category')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Int64Index([2010, 2100, 2110, 3000, 3011, 3033, 3102, 3111, 4001, 4010, 4023,\n", + " 4113, 4200, 4311, 4321, 5010, 5321, 6016, 6100, 6525],\n", + " dtype='int64')" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['child_id'].cat.categories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Int64Index([1000, 2010, 2100, 2110, 3000, 3011, 3033, 3102, 3111, 4023, 5010], dtype='int64')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['parent_id'].cat.categories" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "df['parent_id_new'] = df.parent_id.map({1000:\"A\",\t2010:\"B\",\t2100:\"C\",\t2110:\"D\",\t3000:\"E\",\t3011:\"F\",\t3033:\"G\",\t3102:\"H\",\t3111:\"I\",\t4023:\"K\",\t5010:\"L\"\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "df['child_id_new'] = df.child_id.map({1000:\"A\",\t2010:\"B\",\t2100:\"C\",\t2110:\"D\",\t3000:\"E\",\t3011:\"F\",\t3033:\"G\",\t3102:\"H\",\t3111:\"I\",\t4023:\"K\",\t5010:\"L\",\t4001:\"M\",\t4010:\"N\",\t4113:\"O\",\t4200:\"P\",\t4311:\"Q\",\t4321:\"R\",\t6016:\"S\",\t6525:\"T\",\t6100:\"U\",\t5321:\"V\"\n", + "\n", + "})" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_idparent_id_newchild_id_new
031114321IR
120103102BH
230004023EK
310002010AB
440235321KV
530114200FP
630334113GO
750106525LT
830114010FN
931024001HM
1020103011BF
1140235010KL
1221103000DE
1321003033CG
1410002110AD
1550106100LU
1621103111DI
1710002100AC
1850106016LS
1930334311GQ
\n", + "
" + ], + "text/plain": [ + " parent_id child_id parent_id_new child_id_new\n", + "0 3111 4321 I R\n", + "1 2010 3102 B H\n", + "2 3000 4023 E K\n", + "3 1000 2010 A B\n", + "4 4023 5321 K V\n", + "5 3011 4200 F P\n", + "6 3033 4113 G O\n", + "7 5010 6525 L T\n", + "8 3011 4010 F N\n", + "9 3102 4001 H M\n", + "10 2010 3011 B F\n", + "11 4023 5010 K L\n", + "12 2110 3000 D E\n", + "13 2100 3033 C G\n", + "14 1000 2110 A D\n", + "15 5010 6100 L U\n", + "16 2110 3111 D I\n", + "17 1000 2100 A C\n", + "18 5010 6016 L S\n", + "19 3033 4311 G Q" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pretty print json and dataframe split" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"index\": [\n", + " 0,\n", + " 1,\n", + " 2,\n", + " 3,\n", + " 4,\n", + " 5,\n", + " 6,\n", + " 7,\n", + " 8,\n", + " 9,\n", + " 10,\n", + " 11,\n", + " 12,\n", + " 13,\n", + " 14,\n", + " 15,\n", + " 16,\n", + " 17,\n", + " 18,\n", + " 19\n", + " ],\n", + " \"columns\": [\n", + " \"parent_id\",\n", + " \"child_id\",\n", + " \"parent_id_new\",\n", + " \"child_id_new\"\n", + " ],\n", + " \"data\": [\n", + " [\n", + " 3111,\n", + " 4321,\n", + " \"I\",\n", + " \"R\"\n", + " ],\n", + " [\n", + " 2010,\n", + " 3102,\n", + " \"B\",\n", + " \"H\"\n", + " ],\n", + " [\n", + " 3000,\n", + " 4023,\n", + " \"E\",\n", + " \"K\"\n", + " ],\n", + " [\n", + " 1000,\n", + " 2010,\n", + " \"A\",\n", + " \"B\"\n", + " ],\n", + " [\n", + " 4023,\n", + " 5321,\n", + " \"K\",\n", + " \"V\"\n", + " ],\n", + " [\n", + " 3011,\n", + " 4200,\n", + " \"F\",\n", + " \"P\"\n", + " ],\n", + " [\n", + " 3033,\n", + " 4113,\n", + " \"G\",\n", + " \"O\"\n", + " ],\n", + " [\n", + " 5010,\n", + " 6525,\n", + " \"L\",\n", + " \"T\"\n", + " ],\n", + " [\n", + " 3011,\n", + " 4010,\n", + " \"F\",\n", + " \"N\"\n", + " ],\n", + " [\n", + " 3102,\n", + " 4001,\n", + " \"H\",\n", + " \"M\"\n", + " ],\n", + " [\n", + " 2010,\n", + " 3011,\n", + " \"B\",\n", + " \"F\"\n", + " ],\n", + " [\n", + " 4023,\n", + " 5010,\n", + " \"K\",\n", + " \"L\"\n", + " ],\n", + " [\n", + " 2110,\n", + " 3000,\n", + " \"D\",\n", + " \"E\"\n", + " ],\n", + " [\n", + " 2100,\n", + " 3033,\n", + " \"C\",\n", + " \"G\"\n", + " ],\n", + " [\n", + " 1000,\n", + " 2110,\n", + " \"A\",\n", + " \"D\"\n", + " ],\n", + " [\n", + " 5010,\n", + " 6100,\n", + " \"L\",\n", + " \"U\"\n", + " ],\n", + " [\n", + " 2110,\n", + " 3111,\n", + " \"D\",\n", + " \"I\"\n", + " ],\n", + " [\n", + " 1000,\n", + " 2100,\n", + " \"A\",\n", + " \"C\"\n", + " ],\n", + " [\n", + " 5010,\n", + " 6016,\n", + " \"L\",\n", + " \"S\"\n", + " ],\n", + " [\n", + " 3033,\n", + " 4311,\n", + " \"G\",\n", + " \"Q\"\n", + " ]\n", + " ]\n", + "}\n" + ] + } + ], + "source": [ + "res = df.to_dict(orient='split')\n", + "import json\n", + "print(json.dumps(res, indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"parent_id\": {\n", + " \"0\": 3111,\n", + " \"1\": 2010,\n", + " \"2\": 3000,\n", + " \"3\": 1000,\n", + " \"4\": 4023,\n", + " \"5\": 3011,\n", + " \"6\": 3033,\n", + " \"7\": 5010,\n", + " \"8\": 3011,\n", + " \"9\": 3102,\n", + " \"10\": 2010,\n", + " \"11\": 4023,\n", + " \"12\": 2110,\n", + " \"13\": 2100,\n", + " \"14\": 1000,\n", + " \"15\": 5010,\n", + " \"16\": 2110,\n", + " \"17\": 1000,\n", + " \"18\": 5010,\n", + " \"19\": 3033\n", + " },\n", + " \"child_id\": {\n", + " \"0\": 4321,\n", + " \"1\": 3102,\n", + " \"2\": 4023,\n", + " \"3\": 2010,\n", + " \"4\": 5321,\n", + " \"5\": 4200,\n", + " \"6\": 4113,\n", + " \"7\": 6525,\n", + " \"8\": 4010,\n", + " \"9\": 4001,\n", + " \"10\": 3011,\n", + " \"11\": 5010,\n", + " \"12\": 3000,\n", + " \"13\": 3033,\n", + " \"14\": 2110,\n", + " \"15\": 6100,\n", + " \"16\": 3111,\n", + " \"17\": 2100,\n", + " \"18\": 6016,\n", + " \"19\": 4311\n", + " },\n", + " \"parent_id_new\": {\n", + " \"0\": \"I\",\n", + " \"1\": \"B\",\n", + " \"2\": \"E\",\n", + " \"3\": \"A\",\n", + " \"4\": \"K\",\n", + " \"5\": \"F\",\n", + " \"6\": \"G\",\n", + " \"7\": \"L\",\n", + " \"8\": \"F\",\n", + " \"9\": \"H\",\n", + " \"10\": \"B\",\n", + " \"11\": \"K\",\n", + " \"12\": \"D\",\n", + " \"13\": \"C\",\n", + " \"14\": \"A\",\n", + " \"15\": \"L\",\n", + " \"16\": \"D\",\n", + " \"17\": \"A\",\n", + " \"18\": \"L\",\n", + " \"19\": \"G\"\n", + " },\n", + " \"child_id_new\": {\n", + " \"0\": \"R\",\n", + " \"1\": \"H\",\n", + " \"2\": \"K\",\n", + " \"3\": \"B\",\n", + " \"4\": \"V\",\n", + " \"5\": \"P\",\n", + " \"6\": \"O\",\n", + " \"7\": \"T\",\n", + " \"8\": \"N\",\n", + " \"9\": \"M\",\n", + " \"10\": \"F\",\n", + " \"11\": \"L\",\n", + " \"12\": \"E\",\n", + " \"13\": \"G\",\n", + " \"14\": \"D\",\n", + " \"15\": \"U\",\n", + " \"16\": \"I\",\n", + " \"17\": \"C\",\n", + " \"18\": \"S\",\n", + " \"19\": \"Q\"\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "res = df.to_dict()\n", + "import json\n", + "print(json.dumps(res, indent=2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Traverse a graph" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "lst = [('Linux','Debian'), ('Linux','Red Hat'), ('Debian','Ubuntu'), ('Debian','Knoppix'), \n", + " ('Ubuntu','Linux Mint'), ('Red Hat','CentOS'), ('Red Hat','Mandrake')]\n", + "\n", + "# Build a directed graph and a list of all names that have no parent\n", + "graph = {name: set() for tup in lst for name in tup}\n", + "has_parent = {name: False for tup in lst for name in tup}\n", + "for parent, child in lst:\n", + " graph[parent].add(child)\n", + " has_parent[child] = True\n", + "\n", + "# All names that have absolutely no parent:\n", + "roots = [name for name, parents in has_parent.items() if not parents]\n", + "\n", + "# traversal of the graph (doesn't care about duplicates and cycles)\n", + "def traverse(hierarchy, graph, names):\n", + " for name in names:\n", + " hierarchy[name] = traverse({}, graph, graph[name])\n", + " return hierarchy\n", + "\n", + "nested_json = traverse({}, graph, roots)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"Linux\": {\n", + " \"Debian\": {\n", + " \"Ubuntu\": {\n", + " \"Linux Mint\": {}\n", + " },\n", + " \"Knoppix\": {}\n", + " },\n", + " \"Red Hat\": {\n", + " \"Mandrake\": {},\n", + " \"CentOS\": {}\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "import json\n", + "print(json.dumps(nested_json, indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Linux': {'Debian', 'Red Hat'},\n", + " 'Debian': {'Knoppix', 'Ubuntu'},\n", + " 'Red Hat': {'CentOS', 'Mandrake'},\n", + " 'Ubuntu': {'Linux Mint'},\n", + " 'Knoppix': set(),\n", + " 'Linux Mint': set(),\n", + " 'CentOS': set(),\n", + " 'Mandrake': set()}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Build a directed graph and a list of all names that have no parent\n", + "\n", + "graph = {name: set() for tup in lst for name in tup}\n", + "has_parent = {name: False for tup in lst for name in tup}\n", + "for parent, child in lst:\n", + " graph[parent].add(child)\n", + " has_parent[child] = True\n", + "\n", + "graph " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Linux': False,\n", + " 'Debian': True,\n", + " 'Red Hat': True,\n", + " 'Ubuntu': True,\n", + " 'Knoppix': True,\n", + " 'Linux Mint': True,\n", + " 'CentOS': True,\n", + " 'Mandrake': True}" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "has_parent" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Linux']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# All names that have absolutely no parent:\n", + "roots = [name for name, parents in has_parent.items() if not parents]\n", + "roots" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parent_idchild_id
031114321
120103102
230004023
310002010
440235321
\n", + "
" + ], + "text/plain": [ + " parent_id child_id\n", + "0 3111 4321\n", + "1 2010 3102\n", + "2 3000 4023\n", + "3 1000 2010\n", + "4 4023 5321" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import json\n", + "\n", + "df = pd.DataFrame(\n", + " {\n", + " 'parent_id': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102, 2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033],\n", + " 'child_id': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001, 3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311]\n", + " }\n", + ")\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "lst = json.loads(df.to_json(orient='split'))['data']\n", + "\n", + "# Build a directed graph and a list of all names that have no parent\n", + "graph = {name: set() for tup in lst for name in tup}\n", + "has_parent = {name: False for tup in lst for name in tup}\n", + "for parent, child in lst:\n", + " graph[parent].add(child)\n", + " has_parent[child] = True\n", + "\n", + "# All names that have absolutely no parent:\n", + "roots = [name for name, parents in has_parent.items() if not parents]\n", + "\n", + "# traversal of the graph (doesn't care about duplicates and cycles)\n", + "def traverse(hierarchy, graph, names):\n", + " for name in names:\n", + " hierarchy[name] = traverse({}, graph, graph[name])\n", + " return hierarchy\n", + "\n", + "result = traverse({}, graph, roots)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"1000\": {\n", + " \"2010\": {\n", + " \"3011\": {\n", + " \"4200\": {},\n", + " \"4010\": {}\n", + " },\n", + " \"3102\": {\n", + " \"4001\": {}\n", + " }\n", + " },\n", + " \"2100\": {\n", + " \"3033\": {\n", + " \"4113\": {},\n", + " \"4311\": {}\n", + " }\n", + " },\n", + " \"2110\": {\n", + " \"3000\": {\n", + " \"4023\": {\n", + " \"5321\": {},\n", + " \"5010\": {\n", + " \"6016\": {},\n", + " \"6100\": {},\n", + " \"6525\": {}\n", + " }\n", + " }\n", + " },\n", + " \"3111\": {\n", + " \"4321\": {}\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "import json\n", + "print(json.dumps(result, indent=2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/How_to_extract_information_from_excel_with_Python_and_Pandas.ipynb b/notebooks/How_to_extract_information_from_excel_with_Python_and_Pandas.ipynb new file mode 100644 index 0000000..49500d1 --- /dev/null +++ b/notebooks/How_to_extract_information_from_excel_with_Python_and_Pandas.ipynb @@ -0,0 +1,702 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## How to extract information from excel with Python and Pandas\n", + "\n", + "source\n", + "\n", + "http://blog.softhints.com/excel-export-results-read-excel-python-pandas/\n", + "\n", + "requirements:\n", + "\n", + "```\n", + "pip install xlrd\n", + "pip install pandas\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('display.max_columns', None) # or 1000\n", + "pd.set_option('display.max_rows', None) # or 1000\n", + "pd.set_option('display.max_colwidth', -1) # or 199" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Read excel file with python/pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# read the file\n", + "xls = pd.ExcelFile('~/Documents/example.xlsx')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['People', 'Events', 'Countries']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get all sheet names\n", + "sheet_names = xls.sheet_names\n", + "sheet_names" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
#EventDateVenueLocationAttendanceRef.
0465UFC Fight Night: Assunção vs. Moraes 2Feb 2, 2019Centro de Formação Olímpica do NordesteFortaleza, Brazil10040[21]
1UFC 233Jan 26, 2019Honda CenterAnaheim, California, U.S.Cancelled[22]
2464UFC Fight Night: Cejudo vs. DillashawJan 19, 2019Barclays CenterBrooklyn, New York, U.S.12152[23]
3463UFC 232: Jones vs. Gustafsson 2Dec 29, 2018The ForumInglewood, California, U.S.15862[24]
4462UFC on Fox: Lee vs. Iaquinta 2Dec 15, 2018Fiserv ForumMilwaukee, Wisconsin, U.S.9010[25]
5461UFC 231: Holloway vs. OrtegaDec 8, 2018Scotiabank ArenaToronto, Ontario, Canada19039[26]
6460UFC Fight Night: dos Santos vs. TuivasaDec 2, 2018Adelaide Entertainment CentreAdelaide, Australia8652[27]
7459The Ultimate Fighter: Heavy Hitters FinaleNov 30, 2018Pearl TheatreLas Vegas, Nevada, U.S.2020[28]
8458UFC Fight Night: Blaydes vs. Ngannou 2Nov 24, 2018Cadillac ArenaBeijing, China10302[29]
9457UFC Fight Night: Magny vs. PonzinibbioNov 17, 2018Estadio Mary Terán de WeissBuenos Aires, Argentina10245[30]
\n", + "
" + ], + "text/plain": [ + " # Event Date \\\n", + "0 465 UFC Fight Night: Assunção vs. Moraes 2 Feb 2, 2019 \n", + "1 – UFC 233 Jan 26, 2019 \n", + "2 464 UFC Fight Night: Cejudo vs. Dillashaw Jan 19, 2019 \n", + "3 463 UFC 232: Jones vs. Gustafsson 2 Dec 29, 2018 \n", + "4 462 UFC on Fox: Lee vs. Iaquinta 2 Dec 15, 2018 \n", + "5 461 UFC 231: Holloway vs. Ortega Dec 8, 2018 \n", + "6 460 UFC Fight Night: dos Santos vs. Tuivasa Dec 2, 2018 \n", + "7 459 The Ultimate Fighter: Heavy Hitters Finale Nov 30, 2018 \n", + "8 458 UFC Fight Night: Blaydes vs. Ngannou 2 Nov 24, 2018 \n", + "9 457 UFC Fight Night: Magny vs. Ponzinibbio Nov 17, 2018 \n", + "\n", + " Venue Location \\\n", + "0 Centro de Formação Olímpica do Nordeste Fortaleza, Brazil \n", + "1 Honda Center Anaheim, California, U.S. \n", + "2 Barclays Center Brooklyn, New York, U.S. \n", + "3 The Forum Inglewood, California, U.S. \n", + "4 Fiserv Forum Milwaukee, Wisconsin, U.S. \n", + "5 Scotiabank Arena Toronto, Ontario, Canada \n", + "6 Adelaide Entertainment Centre Adelaide, Australia \n", + "7 Pearl Theatre Las Vegas, Nevada, U.S. \n", + "8 Cadillac Arena Beijing, China \n", + "9 Estadio Mary Terán de Weiss Buenos Aires, Argentina \n", + "\n", + " Attendance Ref. \n", + "0 10040 [21] \n", + "1 Cancelled [22] \n", + "2 12152 [23] \n", + "3 15862 [24] \n", + "4 9010 [25] \n", + "5 19039 [26] \n", + "6 8652 [27] \n", + "7 2020 [28] \n", + "8 10302 [29] \n", + "9 10245 [30] " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get infomration only for one sheet\n", + "df = pd.read_excel(xls, \"Events\")\n", + "df.head(10) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Working with many sheets" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "################################## People ##################################\n", + " OrderDate Country Region\n", + "0 1/6/2018 US East \n", + "1 1/23/2018 Brazil Central\n", + "2 2/9/2018 Congo Central\n", + "3 2/26/2018 Japan Central\n", + "4 3/15/2018 Germany West \n", + "################################## Events ##################################\n", + " # Event Date\n", + "0 465 UFC Fight Night: Assunção vs. Moraes 2 Feb 2, 2019 \n", + "1 – UFC 233 Jan 26, 2019\n", + "2 464 UFC Fight Night: Cejudo vs. Dillashaw Jan 19, 2019\n", + "3 463 UFC 232: Jones vs. Gustafsson 2 Dec 29, 2018\n", + "4 462 UFC on Fox: Lee vs. Iaquinta 2 Dec 15, 2018\n", + "################################## Countries ##################################\n", + " 0 Rank Country\n", + "0 1 1.0 Russia* \n", + "1 2 2.0 China* \n", + "2 3 3.0 India \n", + "3 4 4.0 Kazakhstan* \n", + "4 5 5.0 Saudi Arabia\n" + ] + } + ], + "source": [ + "# read all sheets and extract first 5 rows, 3 columns\n", + "for tab in sheet_names:\n", + " print('################################## ' + tab + ' ##################################')\n", + " df = pd.read_excel(xls, tab)\n", + " print(df.iloc[:5, :3])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search one sheet, one column for a string" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0RankCountryArea (km²)NotesNaN
011.0Russia*1310000017,125,200 including European partNaN
122.0China*9596961excludes Hong Kong, Macau, Taiwan and disputed...NaN
233.0India3287263NaNNaN
344.0Kazakhstan*24550342,724,902 km² including European partNaN
455.0Saudi Arabia2149690NaNNaN
566.0Iran1648195NaNNaN
677.0Mongolia1564110NaNNaN
788.0Indonesia*14726391,904,569 km² including Oceanian partNaN
899.0Pakistan796095882,363 km² including Gilgit-Baltistan and AJKNaN
91010.0Turkey*747272783,562 km² including European partNaN
\n", + "
" + ], + "text/plain": [ + " 0 Rank Country Area (km²) \\\n", + "0 1 1.0 Russia* 13100000 \n", + "1 2 2.0 China* 9596961 \n", + "2 3 3.0 India 3287263 \n", + "3 4 4.0 Kazakhstan* 2455034 \n", + "4 5 5.0 Saudi Arabia 2149690 \n", + "5 6 6.0 Iran 1648195 \n", + "6 7 7.0 Mongolia 1564110 \n", + "7 8 8.0 Indonesia* 1472639 \n", + "8 9 9.0 Pakistan 796095 \n", + "9 10 10.0 Turkey* 747272 \n", + "\n", + " Notes NaN \n", + "0 17,125,200 including European part NaN \n", + "1 excludes Hong Kong, Macau, Taiwan and disputed... NaN \n", + "2 NaN NaN \n", + "3 2,724,902 km² including European part NaN \n", + "4 NaN NaN \n", + "5 NaN NaN \n", + "6 NaN NaN \n", + "7 1,904,569 km² including Oceanian part NaN \n", + "8 882,363 km² including Gilgit-Baltistan and AJK NaN \n", + "9 783,562 km² including European part NaN " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_excel(xls, \"Countries\")\n", + "df.head(10) \n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0RankCountryArea (km²)NotesNaN
122.0China*9596961excludes Hong Kong, Macau, Taiwan and disputed...NaN
\n", + "
" + ], + "text/plain": [ + " 0 Rank Country Area (km²) \\\n", + "1 2 2.0 China* 9596961 \n", + "\n", + " Notes NaN \n", + "1 excludes Hong Kong, Macau, Taiwan and disputed... NaN " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agg = df[df['Country'].str.contains('China', na=False)]\n", + "agg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search in all sheets for a string" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "################################## People ##################################\n", + " Country\n", + "3 Japan \n", + "6 Japan \n", + "18 Japan \n", + "26 Japan \n", + "30 Japan \n", + "################################## Events ##################################\n", + " no column Country \n", + "################################## Countries ##################################\n", + " Country\n", + "17 Japan \n" + ] + } + ], + "source": [ + "# search in every sheet in column Country for word 'Japan'\n", + "# print out message if the column is missing\n", + "for tab in sheet_names:\n", + " print('################################## ' + tab + ' ##################################')\n", + " df = pd.read_excel(xls, tab)\n", + " try:\n", + " agg = df[df['Country'].str.contains('Japan', na=False)]\n", + " print(agg[['Country']])\n", + " except KeyError:\n", + " print(' no column Country ')" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "################################## People ##################################\n", + " Country\n", + "14 China \n", + "16 China \n", + "22 China \n", + "32 China \n", + "33 China \n", + "################################## Events ##################################\n", + " no tab Country \n", + "################################## Countries ##################################\n", + " Country\n", + "1 China*\n" + ] + } + ], + "source": [ + "# search for a partial match\n", + "for tab in sheet_names:\n", + " print('################################## ' + tab + ' ##################################')\n", + " df = pd.read_excel(xls, tab)\n", + " try:\n", + " agg = df[df['Country'].str.contains('China', na=False)]\n", + " print(agg[['Country']])\n", + " except KeyError:\n", + " print(' no tab Country ')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "################################## People ##################################\n", + " Country\n", + "14 China \n", + "16 China \n", + "22 China \n", + "32 China \n", + "33 China \n", + "################################## Events ##################################\n", + " no tab Country \n", + "################################## Countries ##################################\n", + "Empty DataFrame\n", + "Columns: [Country]\n", + "Index: []\n" + ] + } + ], + "source": [ + "# search for a exact match\n", + "for tab in sheet_names:\n", + " print('################################## ' + tab + ' ##################################')\n", + " df = pd.read_excel(xls, tab)\n", + " try:\n", + " agg = df[df['Country'] == 'China']\n", + " print(agg[['Country']])\n", + " except KeyError:\n", + " print(' no tab Country ')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/IPython tricks 2019.ipynb b/notebooks/IPython tricks 2019.ipynb new file mode 100644 index 0000000..2ce6b1e --- /dev/null +++ b/notebooks/IPython tricks 2019.ipynb @@ -0,0 +1,535 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# IPython/Jupyter Notebook tricks for advanced in 2019" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Suppress output in IPython Notebook " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "simple" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "2*2" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "2*2;" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "function" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Private Message\n" + ] + } + ], + "source": [ + "def myfunc():\n", + " print('Private Message')\n", + "myfunc();" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "def myfunc():\n", + " print('Private Message')\n", + "myfunc()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "function 2" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Private Message\n" + ] + } + ], + "source": [ + "def myfunc():\n", + " print('Private Message')\n", + " \n", + "myfunc()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.utils import io\n", + "\n", + "def myfunc():\n", + " print('Private Message')\n", + "\n", + "with io.capture_output() as captured:\n", + " myfunc()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get function docs and arguments IPython Notebook " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy\n", + "table_list = [1,2,3,4,4]\n", + "l = numpy.array_split(table_list, len(table_list)/4)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "? numpy.array_split" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Change theme IPython Notebook \n", + "\n", + "install the module by\n", + "\n", + "`pip install jupyterthemes`\n", + "\n", + "install a theme:\n", + "\n", + "`jt -t chesterish`\n", + "\n", + "restore a theme:\n", + "\n", + "`jt -r`\n", + "\n", + "It can be done even inside jupyter notebook by:\n", + "\n", + "`!jt -r`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!jt -r" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!jt -t chesterish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus: some useful jupyter notebook magics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!jupyter kernelspec list" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy\n", + "print (numpy.__path__)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python 3.6.7\r\n" + ] + } + ], + "source": [ + "!python -V" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!which python" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "appdirs==1.4.3\r\n", + "asn1crypto==0.24.0\r\n", + "atomicwrites==1.2.1\r\n", + "attrs==18.2.0\r\n", + "backcall==0.1.0\r\n", + "black==18.9b0\r\n", + "bleach==3.0.2\r\n", + "boto==2.49.0\r\n", + "boto3==1.9.67\r\n", + "botocore==1.12.67\r\n", + "camelot-py==0.7.1\r\n", + "certifi==2018.8.24\r\n", + "cffi==1.11.5\r\n", + "chardet==3.0.4\r\n", + "Click==7.0\r\n", + "cryptography==2.3.1\r\n", + "cycler==0.10.0\r\n", + "decorator==4.3.0\r\n", + "defusedxml==0.5.0\r\n", + "distro==1.3.0\r\n", + "docutils==0.14\r\n", + "entrypoints==0.2.3\r\n", + "et-xmlfile==1.0.1\r\n", + "filelock==3.0.10\r\n", + "idna==2.7\r\n", + "ipykernel==5.1.0\r\n", + "ipython==7.2.0\r\n", + "ipython-genutils==0.2.0\r\n", + "ipywidgets==7.4.2\r\n", + "jdcal==1.4\r\n", + "jedi==0.13.1\r\n", + "Jinja2==2.10\r\n", + "jira==2.0.0\r\n", + "jmespath==0.9.3\r\n", + "jsonref==0.2\r\n", + "jsonschema==2.6.0\r\n", + "jupyter==1.0.0\r\n", + "jupyter-client==5.2.3\r\n", + "jupyter-console==6.0.0\r\n", + "jupyter-core==4.4.0\r\n", + "jupyterthemes==0.20.0\r\n", + "kiwisolver==1.0.1\r\n", + "lesscpy==0.13.0\r\n", + "lxml==4.3.0\r\n", + "MarkupSafe==1.1.0\r\n", + "matplotlib==3.0.0\r\n", + "mistune==0.8.4\r\n", + "more-itertools==5.0.0\r\n", + "nbconvert==5.4.0\r\n", + "nbformat==4.4.0\r\n", + "notebook==5.7.2\r\n", + "numpy==1.15.1\r\n", + "oauthlib==2.1.0\r\n", + "opencv-python==4.0.0.21\r\n", + "openpyxl==2.5.14\r\n", + "packaging==16.8\r\n", + "pandas==0.23.4\r\n", + "pandocfilters==1.4.2\r\n", + "parso==0.3.1\r\n", + "pbr==4.2.0\r\n", + "pdfminer.six==20181108\r\n", + "pexpect==4.6.0\r\n", + "pickleshare==0.7.5\r\n", + "Pillow==5.2.0\r\n", + "pkg-resources==0.0.0\r\n", + "pluggy==0.8.1\r\n", + "ply==3.11\r\n", + "prometheus-client==0.4.2\r\n", + "prompt-toolkit==2.0.7\r\n", + "ptyprocess==0.6.0\r\n", + "py==1.7.0\r\n", + "py-spy==0.1.8\r\n", + "pycodestyle==2.3.1\r\n", + "pycparser==2.18\r\n", + "pycryptodome==3.7.3\r\n", + "Pygments==2.3.0\r\n", + "PyJWT==1.6.4\r\n", + "PyMySQL==0.9.2\r\n", + "pyparsing==2.2.0\r\n", + "PyPDF2==1.26.0\r\n", + "pytesseract==0.2.4\r\n", + "pytest==4.1.1\r\n", + "python-dateutil==2.7.3\r\n", + "pytz==2018.5\r\n", + "pyzmq==17.1.2\r\n", + "qtconsole==4.4.3\r\n", + "requests==2.19.1\r\n", + "requests-oauthlib==1.0.0\r\n", + "requests-toolbelt==0.8.0\r\n", + "retrying==1.3.3\r\n", + "s3transfer==0.1.13\r\n", + "scrapinghub==2.0.3\r\n", + "selenium==3.14.0\r\n", + "Send2Trash==1.5.0\r\n", + "simplejson==3.10.0\r\n", + "six==1.10.0\r\n", + "sortedcontainers==2.1.0\r\n", + "style==1.1.0\r\n", + "tabula-py==1.3.1\r\n", + "tabulate==0.8.2\r\n", + "terminado==0.8.1\r\n", + "testpath==0.4.2\r\n", + "toml==0.10.0\r\n", + "tornado==5.1.1\r\n", + "tox==3.7.0\r\n", + "traitlets==4.3.2\r\n", + "update==0.0.1\r\n", + "urllib3==1.23\r\n", + "virtualenv==16.3.0\r\n", + "Wand==0.4.4\r\n", + "wcwidth==0.1.7\r\n", + "webencodings==0.5.1\r\n", + "widgetsnbextension==3.4.2\r\n" + ] + } + ], + "source": [ + "!pip freeze" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!echo $PATH " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus 2: Top 10 most useful ipython key shortcuts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* Shift + Enter - \trun cell\n", + "* Alt + Enter - \trun cell, insert below\n", + "* Ctrl + m, c - \tcopy cell\n", + "* Ctrl + m, v - \tpaste cell\n", + "* Ctrl + m, l - \ttoggle line numbers\n", + "* Ctrl + m, j -\tmove cell\n", + "* Ctrl + m, y -\tcode cell\n", + "* Ctrl + m, m -\tmarkdown cell\n", + "* Ctrl + m, . -\trestart kernel\n", + "* Ctrl + m, h -\tshow keyboard shortcuts" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "1+1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "1+1\n", + "## markdown" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Image_validation_with_Python.ipynb b/notebooks/Image_validation_with_Python.ipynb new file mode 100644 index 0000000..c7e9e85 --- /dev/null +++ b/notebooks/Image_validation_with_Python.ipynb @@ -0,0 +1,423 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Image validation with Python\n", + "\n", + "* is a file valid image\n", + " * check file extension\n", + " * check the file with pil\n", + "* is the image blank\n", + "* is the image contains a pattern\n", + "\n", + "#### possible future video:\n", + "* multiple image validation\n", + "* validation url image without donwload\n", + "* search image in image" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## is a file valid image" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check file extension\n", + "test_img = './csv/movie_metadata.csv'\n", + "test_img.lower().endswith(('.png', '.jpg', '.jpeg'))" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check file extension\n", + "test_img = './csv/Selection_001.png'\n", + "test_img.lower().endswith(('.png', '.jpg', '.jpeg'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### check the file with pil\n", + "\n", + "`pip install Pillow`" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "from PIL import Image\n", + "def is_jpg(filename):\n", + " try:\n", + " i=Image.open(filename)\n", + " return i.format in ['PNG', 'JPEG']\n", + " except IOError:\n", + " return False\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_jpg('./csv') " + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_jpg('./csv/movie_metadata.csv') " + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_jpg('./csv/Selection_001.png') " + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_jpg('./csv/Selection_001.png') " + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_jpg('./csv/fire-and-water-2354583_960_720.jpg') " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## is the image blank" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "None\n" + ] + } + ], + "source": [ + "import json\n", + "from io import BytesIO\n", + "from PIL import Image\n", + "import requests\n", + "\n", + "remote_file = 'https://cdn.pixabay.com/photo/2013/03/29/07/34/girl-97433_960_720.jpg'\n", + "\n", + "response = requests.get(remote_file)\n", + "img = Image.open(BytesIO(response.content))\n", + "\n", + "clrs = img.getcolors()\n", + "print(clrs)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import Image\n", + "from IPython.core.display import HTML \n", + "\n", + "color_image = './csv/Selection_139.png'\n", + "\n", + "Image(url= color_image)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "None\n" + ] + } + ], + "source": [ + "import json\n", + "from io import BytesIO\n", + "from PIL import Image\n", + "import requests\n", + "\n", + "img = Image.open(color_image)\n", + "\n", + "clrs = img.getcolors()\n", + "print(clrs)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import Image\n", + "from IPython.core.display import HTML \n", + "\n", + "blank_image = './csv/Selection_140.png'\n", + "\n", + "Image(url= blank_image)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[(49128, (238, 238, 238))]\n" + ] + } + ], + "source": [ + "import json\n", + "from io import BytesIO\n", + "from PIL import Image\n", + "import requests\n", + "\n", + "img = Image.open(blank_image)\n", + "\n", + "clrs = img.getcolors()\n", + "print(clrs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## is the image contains a pattern" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Drawing\"\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import cv2\n", + "import numpy as np\n", + "\n", + "img_rgb = cv2.imread('./csv/image_with_coin.jpg')\n", + "template = cv2.imread('./csv/coin.png')\n", + "w, h = template.shape[:-1]\n", + "\n", + "res = cv2.matchTemplate(img_rgb, template, cv2.TM_CCOEFF_NORMED)\n", + "threshold = .8\n", + "loc = np.where(res >= threshold)\n", + "for pt in zip(*loc[::-1]): \n", + " cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0, 0, 255), 2)\n", + "\n", + "cv2.imwrite('./csv/result.png', img_rgb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import Image\n", + "from IPython.core.display import HTML \n", + "\n", + "Image(url= './csv/result.png')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Load_multiple_CSV_files_into_a_single _Dataframe.ipynb b/notebooks/Load_multiple_CSV_files_into_a_single _Dataframe.ipynb new file mode 100644 index 0000000..5b9eccf --- /dev/null +++ b/notebooks/Load_multiple_CSV_files_into_a_single _Dataframe.ipynb @@ -0,0 +1,456 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', -1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Rename multiple CSV files in a folder with Python" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import glob, os\n", + "\n", + "def rename(dir, pathAndFilename, pattern, titlePattern):\n", + " os.rename(pathAndFilename, os.path.join(dir, titlePattern))\n", + "\n", + "# search for csv files in the working folder \n", + "path = os.path.expanduser(\"~/Projects/MYP/Datasets/test/*.csv\")\n", + "\n", + "# iterate and rename them one by one with the number of the iteration\n", + "for i, fname in enumerate(glob.glob(path)):\n", + " rename(os.path.expanduser('~/Projects/MYP/Datasets/test/'), fname, r'*.csv', r'test{}.csv'.format(i))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load several files into Dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(541, 7)\n", + "(550, 7)\n", + "(1641, 7)\n" + ] + } + ], + "source": [ + "# change separator for CSV file\n", + "df1 = pd.read_csv('~/Projects/MYP/Datasets/test/test0.csv', sep=\"@\")\n", + "df2 = pd.read_csv('~/Projects/MYP/Datasets/test/test1.csv', sep=\"@\")\n", + "df3 = pd.read_csv('~/Projects/MYP/Datasets/test/test1.csv', sep=\"@\")\n", + "\n", + "frames = [df1, df2, df3]\n", + "\n", + "# concatenate multiple data CSV files\n", + "all = pd.concat(frames)\n", + "\n", + "print(df1.shape)\n", + "print(df2.shape)\n", + "print(all.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dynamically Load multiple csv file into Dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeCommentchannel
215Turning Google Earth into SimCity 2000168175.03251.01125.0215.0test0.csv
301Microservices + Events + Docker = A Perfect Trio161110.03213.050.083.0test0.csv
265PHP in 2018 by the Creator of PHP164577.03557.069.0384.0test0.csv
468Developing Blockchain Software169484.02512.0116.0133.0test0.csv
398VS Code: The Last Editor You'll Ever Need172738.01930.0194.0340.0test0.csv
175Coding Challenge #74: Clock with p5.js232227.04609.068.0289.0test1.csv
373Coding Challenge #12: The Lorenz Attractor in Processing217172.03680.043.0333.0test1.csv
44710.4: Loading JSON data from a URL (Asynchronous Callbacks!) - p5.js Tutorial218081.02120.079.0240.0test1.csv
269The Coding Train!218635.02482.083.0324.0test1.csv
193Coding Challenge #71: Minesweeper220816.03334.071.0401.0test1.csv
\n", + "
" + ], + "text/plain": [ + " title \\\n", + "215 Turning Google Earth into SimCity 2000 \n", + "301 Microservices + Events + Docker = A Perfect Trio \n", + "265 PHP in 2018 by the Creator of PHP \n", + "468 Developing Blockchain Software \n", + "398 VS Code: The Last Editor You'll Ever Need \n", + "175 Coding Challenge #74: Clock with p5.js \n", + "373 Coding Challenge #12: The Lorenz Attractor in Processing \n", + "447 10.4: Loading JSON data from a URL (Asynchronous Callbacks!) - p5.js Tutorial \n", + "269 The Coding Train! \n", + "193 Coding Challenge #71: Minesweeper \n", + "\n", + " Views Like Dislike Comment channel \n", + "215 168175.0 3251.0 1125.0 215.0 test0.csv \n", + "301 161110.0 3213.0 50.0 83.0 test0.csv \n", + "265 164577.0 3557.0 69.0 384.0 test0.csv \n", + "468 169484.0 2512.0 116.0 133.0 test0.csv \n", + "398 172738.0 1930.0 194.0 340.0 test0.csv \n", + "175 232227.0 4609.0 68.0 289.0 test1.csv \n", + "373 217172.0 3680.0 43.0 333.0 test1.csv \n", + "447 218081.0 2120.0 79.0 240.0 test1.csv \n", + "269 218635.0 2482.0 83.0 324.0 test1.csv \n", + "193 220816.0 3334.0 71.0 401.0 test1.csv " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import glob\n", + "\n", + "result = pd.DataFrame()\n", + "\n", + "path = os.path.expanduser(\"~/Projects/MYP/Datasets/test/*.csv\")\n", + "\n", + "for fname in glob.glob(path):\n", + " head, tail = os.path.split(fname)\n", + " df = pd.read_csv(fname, sep=\"@\")\n", + " df2 = df.sort_values(by=['Views'], ascending=False).drop(['Favorite', 'videoID'], axis=1).iloc[15:20,:]\n", + " df2['channel'] = tail\n", + " result = pd.concat([result, df2])\n", + "result.sort_values(by=['channel']).iloc[0:10,] " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generate clickable links with pandas and Jupyter notebook" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDnameurl
20How To...80620.0121.013.00.013.0https:...\n", + "
21How To...165533.0432.0143.00.017.0https:...\n", + "
22How To...29636.099.016.00.08.0https:...\n", + "
23How to...409.04.00.00.00.0https:...\n", + "
24How to...31358.059.033.00.02.0https:...\n", + "
25How To...85887.0272.076.00.04.0https:...\n", + "
26How To...61449.095.034.00.00.0https:...\n", + "
27How To...262342.01440.093.00.0447.0https:...\n", + "
28How To...154661.0453.0122.00.011.0https:...\n", + "
29How To...109787.0257.040.00.022.0https:...\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "# convert url column into href tag and add it as a new column to dataframe\n", + "df['nameurl'] = df['videoID'].apply(lambda x: 'XXXXX'.format(x))\n", + "\n", + "\n", + "\n", + "# otherwise the link will be blank\n", + "pd.set_option('display.max_colwidth', 10)\n", + "\n", + "# in order to display HTML code\n", + "HTML(df.iloc[20:30,] .to_html(escape=False))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/notebooks/Pandas count and percentage by value for a column.ipynb b/notebooks/Pandas count and percentage by value for a column.ipynb new file mode 100644 index 0000000..b31455a --- /dev/null +++ b/notebooks/Pandas count and percentage by value for a column.ipynb @@ -0,0 +1,328 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas count and percentage by value for a column\n", + "\n", + "* read remote data from pdf\n", + "* calculate count and percent\n", + "* format percent in better output\n", + "\n", + "Bonus\n", + "\n", + "* pandas column renaming" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
foodPortion sizeper 100 gramsenergy
0Fish cake90 cals per cake200 calsMedium
1Fish fingers50 cals per piece220 calsMedium
2Gammon320 cals280 calsMed-High
3Haddock fresh200 cals110 calsLow calorie
4Halibut fresh220 cals125 calsLow calorie
\n", + "
" + ], + "text/plain": [ + " food Portion size per 100 grams energy\n", + "0 Fish cake 90 cals per cake 200 cals Medium\n", + "1 Fish fingers 50 cals per piece 220 cals Medium\n", + "2 Gammon 320 cals 280 cals Med-High\n", + "3 Haddock fresh 200 cals 110 cals Low calorie\n", + "4 Halibut fresh 220 cals 125 cals Low calorie" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from tabula import read_pdf\n", + "import pandas as pd\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=3, pandas_options={'header': None})\n", + "df.columns = ['food', 'Portion size ', 'per 100 grams', 'energy']\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "s = df.energy" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Medium 14\n", + "High 6\n", + "Low calorie 4\n", + "Med-High 4\n", + "Low-Med 1\n", + "Low- Med 1\n", + "Name: energy, dtype: int64" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "counts = s.value_counts()\n", + "counts" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Medium 0.466667\n", + "High 0.200000\n", + "Low calorie 0.133333\n", + "Med-High 0.133333\n", + "Low-Med 0.033333\n", + "Low- Med 0.033333\n", + "Name: energy, dtype: float64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "percent = s.value_counts(normalize=True)\n", + "percent" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Medium 46.7%\n", + "High 20.0%\n", + "Low calorie 13.3%\n", + "Med-High 13.3%\n", + "Low-Med 3.3%\n", + "Low- Med 3.3%\n", + "Name: energy, dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "percent100 = s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'\n", + "percent100" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countsperper100
Medium140.46666746.7%
High60.20000020.0%
Low calorie40.13333313.3%
Med-High40.13333313.3%
Low-Med10.0333333.3%
Low- Med10.0333333.3%
\n", + "
" + ], + "text/plain": [ + " counts per per100\n", + "Medium 14 0.466667 46.7%\n", + "High 6 0.200000 20.0%\n", + "Low calorie 4 0.133333 13.3%\n", + "Med-High 4 0.133333 13.3%\n", + "Low-Med 1 0.033333 3.3%\n", + "Low- Med 1 0.033333 3.3%" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame({'counts': counts, 'per': percent, 'per100': percent100})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = df.energy\n", + "counts = s.value_counts()\n", + "percent = s.value_counts(normalize=True)\n", + "percent100 = s.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'\n", + "pd.DataFrame({'counts': counts, 'per': percent, 'per100': percent100})" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Pandas is column is contained in another column in the same row.ipynb b/notebooks/Pandas is column is contained in another column in the same row.ipynb new file mode 100644 index 0000000..0ca1dff --- /dev/null +++ b/notebooks/Pandas is column is contained in another column in the same row.ipynb @@ -0,0 +1,3029 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas is column is contained in another column in the same row\n", + "\n", + "dataset:\n", + "\n", + "https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset#movie_metadata.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namenum_critic_for_reviewsduration
0James Cameron723.0178.0
1Gore Verbinski302.0169.0
2Sam Mendes602.0148.0
3Christopher Nolan813.0164.0
4Doug WalkerNaNNaN
\n", + "
" + ], + "text/plain": [ + " director_name num_critic_for_reviews duration\n", + "0 James Cameron 723.0 178.0\n", + "1 Gore Verbinski 302.0 169.0\n", + "2 Sam Mendes 602.0 148.0\n", + "3 Christopher Nolan 813.0 164.0\n", + "4 Doug Walker NaN NaN" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read a dataset movies\n", + "import pandas as pd\n", + "movies = pd.read_csv('./csv/movie_metadata.csv', usecols=[1,2,3])\n", + "movies.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_nameactor_2_namegrossgenresmovie_titleplot_keywordscontent_rating
5038Scott SmithDaphne ZunigaNaNComedy|DramaSigned Sealed Deliveredfraud|postal worker|prison|theft|trialNaN
5039NaNValorie CurryNaNCrime|Drama|Mystery|ThrillerThe Followingcult|fbi|hideout|prison escape|serial killerTV-14
5040Benjamin RoberdsMaxwell MoodyNaNDrama|Horror|ThrillerA Plague So PleasantNaNNaN
5041Daniel HsiaDaniel Henney10443.0Comedy|Drama|RomanceShanghai CallingNaNPG-13
5042Jon GunnBrian Herzlinger85222.0DocumentaryMy Date with Drewactress name in title|crush|date|four word tit...PG
\n", + "
" + ], + "text/plain": [ + " director_name actor_2_name gross \\\n", + "5038 Scott Smith Daphne Zuniga NaN \n", + "5039 NaN Valorie Curry NaN \n", + "5040 Benjamin Roberds Maxwell Moody NaN \n", + "5041 Daniel Hsia Daniel Henney 10443.0 \n", + "5042 Jon Gunn Brian Herzlinger 85222.0 \n", + "\n", + " genres movie_title \\\n", + "5038 Comedy|Drama Signed Sealed Delivered  \n", + "5039 Crime|Drama|Mystery|Thriller The Following  \n", + "5040 Drama|Horror|Thriller A Plague So Pleasant  \n", + "5041 Comedy|Drama|Romance Shanghai Calling  \n", + "5042 Documentary My Date with Drew  \n", + "\n", + " plot_keywords content_rating \n", + "5038 fraud|postal worker|prison|theft|trial NaN \n", + "5039 cult|fbi|hideout|prison escape|serial killer TV-14 \n", + "5040 NaN NaN \n", + "5041 NaN PG-13 \n", + "5042 actress name in title|crush|date|four word tit... PG " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# read a dataset movies\n", + "import pandas as pd\n", + "movies = pd.read_csv('./csv/movie_metadata.csv', usecols=['movie_title', 'director_name', 'actor_2_name', 'content_rating','plot_keywords','gross', 'genres'])\n", + "movies.tail()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Compare if two columns match" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titledirector_nameactor_2_name
437The ExpendablesSylvester StalloneSylvester Stallone
504OceansJacques PerrinJacques Perrin
600Star Trek: InsurrectionJonathan FrakesJonathan Frakes
931TedSeth MacFarlaneSeth MacFarlane
1057Dick TracyWarren BeattyWarren Beatty
\n", + "
" + ], + "text/plain": [ + " movie_title director_name actor_2_name\n", + "437 The Expendables  Sylvester Stallone Sylvester Stallone\n", + "504 Oceans  Jacques Perrin Jacques Perrin\n", + "600 Star Trek: Insurrection  Jonathan Frakes Jonathan Frakes\n", + "931 Ted  Seth MacFarlane Seth MacFarlane\n", + "1057 Dick Tracy  Warren Beatty Warren Beatty" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check if two columns in a single row are identical\n", + "df2 = movies.loc[(movies.director_name == movies.actor_2_name), ['movie_title', 'director_name', 'actor_2_name']]\n", + "df2.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filter on two conditions" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_nameactor_2_namegrossgenresmovie_titleplot_keywordscontent_rating
1Gore VerbinskiOrlando Bloom309404152.0Action|Adventure|FantasyPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...PG-13
13Gore VerbinskiOrlando Bloom423032628.0Action|Adventure|FantasyPirates of the Caribbean: Dead Man's Chestbox office hit|giant squid|heart|liar's dice|m...PG-13
18Rob MarshallSam Claflin241063875.0Action|Adventure|FantasyPirates of the Caribbean: On Stranger Tidesblackbeard|captain|pirate|revenge|soldierPG-13
21Marc WebbAndrew Garfield262030663.0Action|Adventure|FantasyThe Amazing Spider-Manlizard|outcast|spider|spider man|teenagerPG-13
54Steven SpielbergRay Winstone317011114.0Action|Adventure|FantasyIndiana Jones and the Kingdom of the Crystal S...cult figure|femme fatale|indiana jones|unsubti...PG-13
\n", + "
" + ], + "text/plain": [ + " director_name actor_2_name gross genres \\\n", + "1 Gore Verbinski Orlando Bloom 309404152.0 Action|Adventure|Fantasy \n", + "13 Gore Verbinski Orlando Bloom 423032628.0 Action|Adventure|Fantasy \n", + "18 Rob Marshall Sam Claflin 241063875.0 Action|Adventure|Fantasy \n", + "21 Marc Webb Andrew Garfield 262030663.0 Action|Adventure|Fantasy \n", + "54 Steven Spielberg Ray Winstone 317011114.0 Action|Adventure|Fantasy \n", + "\n", + " movie_title \\\n", + "1 Pirates of the Caribbean: At World's End  \n", + "13 Pirates of the Caribbean: Dead Man's Chest  \n", + "18 Pirates of the Caribbean: On Stranger Tides  \n", + "21 The Amazing Spider-Man  \n", + "54 Indiana Jones and the Kingdom of the Crystal S... \n", + "\n", + " plot_keywords content_rating \n", + "1 goddess|marriage ceremony|marriage proposal|pi... PG-13 \n", + "13 box office hit|giant squid|heart|liar's dice|m... PG-13 \n", + "18 blackbeard|captain|pirate|revenge|soldier PG-13 \n", + "21 lizard|outcast|spider|spider man|teenager PG-13 \n", + "54 cult figure|femme fatale|indiana jones|unsubti... PG-13 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# several condtitions can be applied with loc\n", + "df2 = movies.loc[(movies.gross > 100000000.0) & (movies.genres == 'Action|Adventure|Fantasy'), :]\n", + "df2.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Search if a column is part of another column on the same row" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "movies['rat'] = movies.content_rating.astype(str)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_nameactor_2_namegrossgenresmovie_titleplot_keywordscontent_ratingrat
0James CameronJoel David Moore760505847.0Action|Adventure|Fantasy|Sci-FiAvataravatar|future|marine|native|paraplegicPG-13PG-13
1Gore VerbinskiOrlando Bloom309404152.0Action|Adventure|FantasyPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...PG-13PG-13
2Sam MendesRory Kinnear200074175.0Action|Adventure|ThrillerSpectrebomb|espionage|sequel|spy|terroristPG-13PG-13
3Christopher NolanChristian Bale448130642.0Action|ThrillerThe Dark Knight Risesdeception|imprisonment|lawlessness|police offi...PG-13PG-13
4Doug WalkerRob WalkerNaNDocumentaryStar Wars: Episode VII - The Force Awakens  ...NaNNaNnan
\n", + "
" + ], + "text/plain": [ + " director_name actor_2_name gross \\\n", + "0 James Cameron Joel David Moore 760505847.0 \n", + "1 Gore Verbinski Orlando Bloom 309404152.0 \n", + "2 Sam Mendes Rory Kinnear 200074175.0 \n", + "3 Christopher Nolan Christian Bale 448130642.0 \n", + "4 Doug Walker Rob Walker NaN \n", + "\n", + " genres \\\n", + "0 Action|Adventure|Fantasy|Sci-Fi \n", + "1 Action|Adventure|Fantasy \n", + "2 Action|Adventure|Thriller \n", + "3 Action|Thriller \n", + "4 Documentary \n", + "\n", + " movie_title \\\n", + "0 Avatar  \n", + "1 Pirates of the Caribbean: At World's End  \n", + "2 Spectre  \n", + "3 The Dark Knight Rises  \n", + "4 Star Wars: Episode VII - The Force Awakens  ... \n", + "\n", + " plot_keywords content_rating rat \n", + "0 avatar|future|marine|native|paraplegic PG-13 PG-13 \n", + "1 goddess|marriage ceremony|marriage proposal|pi... PG-13 PG-13 \n", + "2 bomb|espionage|sequel|spy|terrorist PG-13 PG-13 \n", + "3 deception|imprisonment|lawlessness|police offi... PG-13 PG-13 \n", + "4 NaN NaN nan " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using lambda expression" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* **Axis 0** iterate on all the ROWS in each COLUMN\n", + "* **Axis 1** iterate on all the COLUMNS in each ROW" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
content_ratingmovie_title
94RTerminator 3: Rise of the Machines
124RThe Matrix Revolutions
126RThe Matrix Reloaded
128RMad Max: Fury Road
179RThe Revenant
\n", + "
" + ], + "text/plain": [ + " content_rating movie_title\n", + "94 R Terminator 3: Rise of the Machines \n", + "124 R The Matrix Revolutions \n", + "126 R The Matrix Reloaded \n", + "128 R Mad Max: Fury Road \n", + "179 R The Revenant " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_filter = movies[movies.apply(lambda row: row.rat in row.movie_title, axis=1)]\n", + "df_filter[['content_rating','movie_title']].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
content_ratingmovie_title
94RTerminator 3: Rise of the Machines
124RThe Matrix Revolutions
126RThe Matrix Reloaded
128RMad Max: Fury Road
179RThe Revenant
\n", + "
" + ], + "text/plain": [ + " content_rating movie_title\n", + "94 R Terminator 3: Rise of the Machines \n", + "124 R The Matrix Revolutions \n", + "126 R The Matrix Reloaded \n", + "128 R Mad Max: Fury Road \n", + "179 R The Revenant " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_filter = movies[movies.apply(lambda row: row.movie_title.find(row.rat) != -1, axis=1)]\n", + "df_filter[['content_rating','movie_title']].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "14" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'Terminator 3: Rise of the Machines'.find('R')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "-1" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'Terminator 3: Rise of the Machines'.find('A')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'Terminator 3: Rise of the Machines'.find('A') != -1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using for loop and series to search" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "rating_list = movies['content_rating']\n", + "title_list = movies['movie_title']" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[False, False, False, False, False]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_part = []\n", + "for i, rate in enumerate(rating_list):\n", + " if not pd.isna(title_list[i]) and (str(rate) in title_list[i]):\n", + " is_part.append(True)\n", + " else:\n", + " is_part.append(False)\n", + "is_part[:5] " + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
content_ratingmovie_title
94RTerminator 3: Rise of the Machines
124RThe Matrix Revolutions
126RThe Matrix Reloaded
128RMad Max: Fury Road
179RThe Revenant
\n", + "
" + ], + "text/plain": [ + " content_rating movie_title\n", + "94 R Terminator 3: Rise of the Machines \n", + "124 R The Matrix Revolutions \n", + "126 R The Matrix Reloaded \n", + "128 R Mad Max: Fury Road \n", + "179 R The Revenant " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[is_part][['content_rating','movie_title']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## search part of a column it it contained in another column?" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 avatar\n", + "1 goddess\n", + "2 bomb\n", + "3 deception\n", + "4 NaN\n", + "Name: plot_keywords, dtype: object" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "keywords_list = movies['plot_keywords']\n", + "title_list = movies['movie_title']\n", + "keys_list = movies['plot_keywords'].str.split('|').str.get(0)\n", + "keys_list[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 Avatar \n", + "1 Pirates of the Caribbean: At World's End \n", + "2 Spectre \n", + "3 The Dark Knight Rises \n", + "4 Star Wars: Episode VII - The Force Awakens  ...\n", + "Name: movie_title, dtype: object" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "title_list[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[False, False, False, False, False]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_part = []\n", + "for i, key in enumerate(keys_list):\n", + " if not pd.isna(title_list[i]) and (str(key) in title_list[i]):\n", + " is_part.append(True)\n", + " else:\n", + " is_part.append(False)\n", + "is_part[:5] " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
plot_keywordsmovie_title
620ball|blood|skating|song|year 2005Rollerball
989boy|bully|dream|dream sequence|planetThe Adventures of Sharkboy and Lavagirl 3-D
1767cop|corrupt politician|future|senator|time travelTimecop
4185ape|fear|future|spacecraft|spaceshipEscape from the Planet of the Apes
\n", + "
" + ], + "text/plain": [ + " plot_keywords \\\n", + "620 ball|blood|skating|song|year 2005 \n", + "989 boy|bully|dream|dream sequence|planet \n", + "1767 cop|corrupt politician|future|senator|time travel \n", + "4185 ape|fear|future|spacecraft|spaceship \n", + "\n", + " movie_title \n", + "620 Rollerball  \n", + "989 The Adventures of Sharkboy and Lavagirl 3-D  \n", + "1767 Timecop  \n", + "4185 Escape from the Planet of the Apes  " + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[is_part][['plot_keywords','movie_title']]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " False,\n", + " ...]" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "is_part = []\n", + "for i, key in enumerate(keys_list):\n", + " if not pd.isna(title_list[i]) and (str(key) in title_list[i] or str(key).lower() == title_list[i].strip().lower()):\n", + " is_part.append(True)\n", + " else:\n", + " is_part.append(False)\n", + "is_part" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
plot_keywordsmovie_title
0avatar|future|marine|native|paraplegicAvatar
33alice in wonderland|mistaking reality for drea...Alice in Wonderland
196australia|cattle|darwin|drover|japaneseAustralia
620ball|blood|skating|song|year 2005Rollerball
742contagion|cure|infection|panic|virusContagion
833anger management|argument|irony|sarcasm|therapistAnger Management
847casper|friendly ghost|ghost|maine|mansionCasper
888burlesque|dancer|iowa|small town girl|stageBurlesque
890lolita|nymphet|older man young girl relationsh...Lolita
989boy|bully|dream|dream sequence|planetThe Adventures of Sharkboy and Lavagirl 3-D
1160cleopatra|egypt|epic|queen|roman empireCleopatra
1580arachnophobia|death|doctor|small town|spiderArachnophobia
1767cop|corrupt politician|future|senator|time travelTimecop
1997blindness|epidemic|hospital|pubic hair|quarantineBlindness
2175drumline|drummer|fish out of water|marching ba...Drumline
2186boogeyman|childhood|fear|hometown|uncleBoogeyman
2318flash of genius|genius|intellectual property|p...Flash of Genius
2349ramanujanRamanujan
2481hitman|impersonation|see through dress|topless...Hitman
2492halloween|masked killer|michael myers|slasher|...Halloween
2619halloween|masked killer|michael myers|slasher|...Halloween
2911machete|mexican|mexico|priest|texasMachete
2981celebrity|journalist|lesbian kiss|strong femal...Celebrity
3036phone booth|publicist|single set production|sn...Phone Booth
3113devil|elevator|hit and run|throat slitting|tra...Devil
3277alien|creature|future|outer space|spaceshipAlien
3617college|face slap|high school|loss of virginit...College
3745april fool's day|island|mansion|psycho|secretApril Fool's Day
4109freeway|nightmare|police|school|trailer parkFreeway
4128alice in wonderland|mistaking reality for drea...Alice in Wonderland
4185ape|fear|future|spacecraft|spaceshipEscape from the Planet of the Apes
4256lolita|nymphet|older man young girl relationsh...Lolita
4393caramel|friendship|police|secret|suitorCaramel
4821halloween|masked killer|michael myers|slasher|...Halloween
4939aroused|photography|pornography documentary|po...Aroused
\n", + "
" + ], + "text/plain": [ + " plot_keywords \\\n", + "0 avatar|future|marine|native|paraplegic \n", + "33 alice in wonderland|mistaking reality for drea... \n", + "196 australia|cattle|darwin|drover|japanese \n", + "620 ball|blood|skating|song|year 2005 \n", + "742 contagion|cure|infection|panic|virus \n", + "833 anger management|argument|irony|sarcasm|therapist \n", + "847 casper|friendly ghost|ghost|maine|mansion \n", + "888 burlesque|dancer|iowa|small town girl|stage \n", + "890 lolita|nymphet|older man young girl relationsh... \n", + "989 boy|bully|dream|dream sequence|planet \n", + "1160 cleopatra|egypt|epic|queen|roman empire \n", + "1580 arachnophobia|death|doctor|small town|spider \n", + "1767 cop|corrupt politician|future|senator|time travel \n", + "1997 blindness|epidemic|hospital|pubic hair|quarantine \n", + "2175 drumline|drummer|fish out of water|marching ba... \n", + "2186 boogeyman|childhood|fear|hometown|uncle \n", + "2318 flash of genius|genius|intellectual property|p... \n", + "2349 ramanujan \n", + "2481 hitman|impersonation|see through dress|topless... \n", + "2492 halloween|masked killer|michael myers|slasher|... \n", + "2619 halloween|masked killer|michael myers|slasher|... \n", + "2911 machete|mexican|mexico|priest|texas \n", + "2981 celebrity|journalist|lesbian kiss|strong femal... \n", + "3036 phone booth|publicist|single set production|sn... \n", + "3113 devil|elevator|hit and run|throat slitting|tra... \n", + "3277 alien|creature|future|outer space|spaceship \n", + "3617 college|face slap|high school|loss of virginit... \n", + "3745 april fool's day|island|mansion|psycho|secret \n", + "4109 freeway|nightmare|police|school|trailer park \n", + "4128 alice in wonderland|mistaking reality for drea... \n", + "4185 ape|fear|future|spacecraft|spaceship \n", + "4256 lolita|nymphet|older man young girl relationsh... \n", + "4393 caramel|friendship|police|secret|suitor \n", + "4821 halloween|masked killer|michael myers|slasher|... \n", + "4939 aroused|photography|pornography documentary|po... \n", + "\n", + " movie_title \n", + "0 Avatar  \n", + "33 Alice in Wonderland  \n", + "196 Australia  \n", + "620 Rollerball  \n", + "742 Contagion  \n", + "833 Anger Management  \n", + "847 Casper  \n", + "888 Burlesque  \n", + "890 Lolita  \n", + "989 The Adventures of Sharkboy and Lavagirl 3-D  \n", + "1160 Cleopatra  \n", + "1580 Arachnophobia  \n", + "1767 Timecop  \n", + "1997 Blindness  \n", + "2175 Drumline  \n", + "2186 Boogeyman  \n", + "2318 Flash of Genius  \n", + "2349 Ramanujan  \n", + "2481 Hitman  \n", + "2492 Halloween  \n", + "2619 Halloween  \n", + "2911 Machete  \n", + "2981 Celebrity  \n", + "3036 Phone Booth  \n", + "3113 Devil  \n", + "3277 Alien  \n", + "3617 College  \n", + "3745 April Fool's Day  \n", + "4109 Freeway  \n", + "4128 Alice in Wonderland  \n", + "4185 Escape from the Planet of the Apes  \n", + "4256 Lolita  \n", + "4393 Caramel  \n", + "4821 Halloween  \n", + "4939 Aroused  " + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[is_part][['plot_keywords','movie_title']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Performance tests\n", + "\n", + "* for loop and comparison - 8.304 seconds\n", + "* lambda expresion - 25.893 seconds" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 18545451 function calls (18544652 primitive calls) in 8.718 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 198 0.000 0.000 0.000 0.000 :416(parent)\n", + " 1290 0.001 0.000 0.002 0.000 :997(_handle_fromlist)\n", + " 1 0.498 0.498 8.718 8.718 :7(before)\n", + " 1 0.000 0.000 8.718 8.718 :1()\n", + " 100 0.000 0.000 0.000 0.000 __init__.py:205(iteritems)\n", + " 497 0.000 0.000 0.003 0.000 _methods.py:42(_any)\n", + " 297 0.000 0.000 0.001 0.000 algorithms.py:1421(_get_take_nd_function)\n", + " 297 0.003 0.000 0.031 0.000 algorithms.py:1548(take_nd)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1569(is_unique)\n", + " 499257 0.441 0.000 1.580 0.000 base.py:1647(_convert_scalar_indexer)\n", + " 2 0.000 0.000 0.000 0.000 base.py:1935(_engine)\n", + " 2 0.000 0.000 0.000 0.000 base.py:1938()\n", + " 297 0.001 0.000 0.001 0.000 base.py:2033(__contains__)\n", + " 4 0.000 0.000 0.000 0.000 base.py:2067(__getitem__)\n", + " 99 0.000 0.000 0.010 0.000 base.py:2179(take)\n", + " 99 0.000 0.000 0.002 0.000 base.py:2445(equals)\n", + " 99 0.002 0.000 0.007 0.000 base.py:255(__new__)\n", + " 8 0.000 0.000 0.000 0.000 base.py:3071(get_loc)\n", + " 499257 1.162 0.000 7.143 0.000 base.py:3090(get_value)\n", + " 499257 0.120 0.000 0.183 0.000 base.py:4117(_maybe_cast_indexer)\n", + " 100 0.000 0.000 0.001 0.000 base.py:473(_simple_new)\n", + " 204 0.000 0.000 0.000 0.000 base.py:4914(_ensure_index)\n", + " 99 0.001 0.000 0.008 0.000 base.py:520(_shallow_copy_with_infer)\n", + " 1685 0.001 0.000 0.003 0.000 base.py:61(is_dtype)\n", + " 99 0.000 0.000 0.000 0.000 base.py:615(is_)\n", + " 100 0.000 0.000 0.000 0.000 base.py:635(_reset_identity)\n", + " 799 0.000 0.000 0.000 0.000 base.py:641(__len__)\n", + " 99 0.000 0.000 0.000 0.000 base.py:662(dtype)\n", + " 301 0.000 0.000 0.000 0.000 base.py:672(values)\n", + " 2 0.000 0.000 0.000 0.000 base.py:677(_values)\n", + " 198 0.000 0.000 0.000 0.000 base.py:711(get_values)\n", + " 2 0.000 0.000 0.000 0.000 base.py:789(_ndarray_values)\n", + " 99 0.000 0.000 0.005 0.000 base.py:893(tolist)\n", + " 99 0.000 0.000 0.000 0.000 base.py:904(_coerce_to_ndarray)\n", + " 99 0.000 0.000 0.005 0.000 base.py:912(__iter__)\n", + " 99 0.000 0.000 0.000 0.000 base.py:920(_get_attributes_dict)\n", + " 99 0.000 0.000 0.000 0.000 base.py:922()\n", + " 297 0.001 0.000 0.008 0.000 cast.py:257(maybe_promote)\n", + " 99 0.001 0.000 0.011 0.000 common.py:100(is_bool_indexer)\n", + " 99 0.000 0.000 0.001 0.000 common.py:1043(is_datetime64_any_dtype)\n", + " 401 0.000 0.000 0.001 0.000 common.py:122(is_sparse)\n", + " 993 0.002 0.000 0.006 0.000 common.py:1688(is_extension_array_dtype)\n", + " 792 0.000 0.000 0.001 0.000 common.py:1784(_get_dtype)\n", + " 594 0.001 0.000 0.002 0.000 common.py:1835(_get_dtype_type)\n", + " 1 0.000 0.000 0.000 0.000 common.py:195(is_categorical)\n", + " 991 0.000 0.000 0.004 0.000 common.py:227(is_datetimetz)\n", + " 198 0.000 0.000 0.001 0.000 common.py:332(is_datetime64_dtype)\n", + " 1189 0.001 0.000 0.003 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 499554 0.125 0.000 0.169 0.000 common.py:395(_apply_if_callable)\n", + " 198 0.000 0.000 0.001 0.000 common.py:407(is_timedelta64_dtype)\n", + " 495 0.000 0.000 0.001 0.000 common.py:477(is_interval_dtype)\n", + " 199 0.000 0.000 0.000 0.000 common.py:513(is_categorical_dtype)\n", + " 99 0.000 0.000 0.001 0.000 common.py:647(is_datetimelike)\n", + " 396 0.000 0.000 0.001 0.000 common.py:692(is_dtype_equal)\n", + " 99 0.000 0.000 0.000 0.000 common.py:858(is_signed_integer_dtype)\n", + " 99 0.000 0.000 0.001 0.000 common.py:89(is_object_dtype)\n", + " 4 0.000 0.000 0.000 0.000 concat.py:105(_get_sliced_frame_result_type)\n", + " 99 0.000 0.000 0.001 0.000 dtypes.py:401(__new__)\n", + " 99 0.000 0.000 0.001 0.000 dtypes.py:459(construct_from_string)\n", + " 396 0.001 0.000 0.001 0.000 dtypes.py:707(is_dtype)\n", + " 297 0.002 0.000 0.098 0.000 frame.py:2664(__getitem__)\n", + " 198 0.000 0.000 0.001 0.000 frame.py:2690(_getitem_column)\n", + " 99 0.001 0.000 0.094 0.001 frame.py:2707(_getitem_array)\n", + " 4 0.000 0.000 0.000 0.000 frame.py:3093(_box_item_values)\n", + " 4 0.000 0.000 0.000 0.000 frame.py:3100(_box_col_values)\n", + " 99 0.000 0.000 0.000 0.000 frame.py:320(_constructor)\n", + " 99 0.000 0.000 0.001 0.000 frame.py:334(__init__)\n", + " 1 0.000 0.000 0.000 0.000 fromnumeric.py:49(_wrapfunc)\n", + " 1 0.000 0.000 0.000 0.000 fromnumeric.py:882(argsort)\n", + " 103 0.000 0.000 0.000 0.000 generic.py:124(__init__)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:178(_init_mgr)\n", + " 198 0.000 0.000 0.000 0.000 generic.py:2484(_get_item_cache)\n", + " 4 0.000 0.000 0.000 0.000 generic.py:2498(_set_as_cached)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2577(_clear_item_cache)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:2603(_set_is_copy)\n", + " 99 0.001 0.000 0.071 0.001 generic.py:2783(_take)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:364(_get_axis_number)\n", + " 198 0.000 0.000 0.000 0.000 generic.py:377(_get_axis_name)\n", + " 198 0.000 0.000 0.001 0.000 generic.py:390(_get_axis)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:394(_get_block_manager_axis)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:4345(__finalize__)\n", + " 4 0.000 0.000 0.000 0.000 generic.py:4362(__getattr__)\n", + " 210 0.000 0.000 0.000 0.000 generic.py:4378(__setattr__)\n", + " 99 0.000 0.000 0.002 0.000 generic.py:4423(_protect_consolidate)\n", + " 99 0.000 0.000 0.002 0.000 generic.py:4433(_consolidate_inplace)\n", + " 99 0.000 0.000 0.002 0.000 generic.py:4436(f)\n", + " 504025 0.178 0.000 0.254 0.000 generic.py:7(_check)\n", + " 99 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)\n", + " 99 0.000 0.000 0.010 0.000 indexing.py:2345(check_bool_indexer)\n", + " 99 0.002 0.000 0.004 0.000 indexing.py:2441(maybe_convert_indices)\n", + " 4 0.000 0.000 0.000 0.000 inference.py:415(is_hashable)\n", + " 302 0.001 0.000 0.001 0.000 internals.py:116(__init__)\n", + " 297 0.001 0.000 0.036 0.000 internals.py:1237(take_nd)\n", + " 302 0.000 0.000 0.000 0.000 internals.py:127(_check_ndim)\n", + " 8 0.000 0.000 0.000 0.000 internals.py:166(_consolidate_key)\n", + " 499554 0.069 0.000 0.069 0.000 internals.py:203(internal_values)\n", + " 499257 0.142 0.000 0.290 0.000 internals.py:222(to_dense)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:229(fill_value)\n", + " 104 0.000 0.000 0.001 0.000 internals.py:2298(__init__)\n", + " 1206 0.000 0.000 0.000 0.000 internals.py:233(mgr_locs)\n", + " 302 0.000 0.000 0.000 0.000 internals.py:237(mgr_locs)\n", + " 301 0.000 0.000 0.003 0.000 internals.py:269(make_block_same_class)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3148(get_block_type)\n", + " 302 0.001 0.000 0.002 0.000 internals.py:3191(make_block)\n", + " 100 0.000 0.000 0.008 0.000 internals.py:3265(__init__)\n", + " 100 0.000 0.000 0.000 0.000 internals.py:3266()\n", + " 401 0.001 0.000 0.002 0.000 internals.py:3307(shape)\n", + " 1203 0.000 0.000 0.001 0.000 internals.py:3309()\n", + " 400 0.000 0.000 0.000 0.000 internals.py:3311(ndim)\n", + " 101 0.002 0.000 0.004 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)\n", + " 108 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 301 0.000 0.000 0.000 0.000 internals.py:348(shape)\n", + " 100 0.001 0.000 0.002 0.000 internals.py:3488(_verify_integrity)\n", + " 401 0.000 0.000 0.000 0.000 internals.py:3490()\n", + " 499867 0.101 0.000 0.101 0.000 internals.py:352(dtype)\n", + " 305 0.000 0.000 0.001 0.000 internals.py:356(ftype)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:372(iget)\n", + " 298 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)\n", + " 101 0.000 0.000 0.001 0.000 internals.py:3784(_consolidate_check)\n", + " 101 0.000 0.000 0.001 0.000 internals.py:3785()\n", + " 99 0.000 0.000 0.001 0.000 internals.py:4085(consolidate)\n", + " 199 0.000 0.000 0.001 0.000 internals.py:4101(_consolidate_inplace)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:4108(get)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:4137(iget)\n", + " 99 0.000 0.000 0.046 0.000 internals.py:4388(reindex_indexer)\n", + " 99 0.000 0.000 0.037 0.000 internals.py:4423()\n", + " 99 0.001 0.000 0.063 0.001 internals.py:4518(take)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:4639(__init__)\n", + " 1498068 0.278 0.000 0.278 0.000 internals.py:4684(_block)\n", + " 499257 0.254 0.000 0.453 0.000 internals.py:4718(dtype)\n", + " 499554 0.266 0.000 0.429 0.000 internals.py:4745(internal_values)\n", + " 499257 0.357 0.000 0.856 0.000 internals.py:4752(get_values)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:5057(_consolidate)\n", + " 8 0.000 0.000 0.000 0.000 internals.py:5063()\n", + " 3 0.000 0.000 0.001 0.000 internals.py:5074(_merge_blocks)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5088()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5089()\n", + " 3 0.000 0.000 0.000 0.000 internals.py:5101(_extend_blocks)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5127(_vstack)\n", + " 4 0.000 0.000 0.000 0.000 missing.py:112(_isna_new)\n", + " 4 0.000 0.000 0.000 0.000 missing.py:32(isna)\n", + " 99 0.000 0.000 0.000 0.000 missing.py:376(array_equivalent)\n", + " 4 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)\n", + " 499257 0.530 0.000 2.293 0.000 numeric.py:179(_convert_scalar_indexer)\n", + " 100 0.000 0.000 0.002 0.000 numeric.py:35(__new__)\n", + " 693 0.000 0.000 0.020 0.000 numeric.py:433(asarray)\n", + " 101 0.000 0.000 0.001 0.000 numeric.py:504(asanyarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:630(require)\n", + " 99 0.000 0.000 0.008 0.000 numeric.py:64(_shallow_copy)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:701()\n", + " 1 0.000 0.000 0.000 0.000 range.py:169(_data)\n", + " 1 0.000 0.000 0.000 0.000 range.py:173(_int64index)\n", + " 99 0.000 0.000 0.009 0.000 range.py:260(_shallow_copy)\n", + " 499461 0.280 0.000 0.406 0.000 range.py:481(__len__)\n", + " 4 0.000 0.000 0.000 0.000 series.py:166(__init__)\n", + " 4 0.000 0.000 0.000 0.000 series.py:365(_set_axis)\n", + " 4 0.000 0.000 0.000 0.000 series.py:391(_set_subtyp)\n", + " 4 0.000 0.000 0.000 0.000 series.py:401(name)\n", + " 4 0.000 0.000 0.000 0.000 series.py:405(name)\n", + " 499257 0.218 0.000 0.671 0.000 series.py:412(dtype)\n", + " 499554 0.181 0.000 0.610 0.000 series.py:465(_values)\n", + " 499257 0.175 0.000 1.031 0.000 series.py:476(get_values)\n", + " 499257 0.703 0.000 8.076 0.000 series.py:764(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 shape_base.py:182(vstack)\n", + " 1 0.000 0.000 0.000 0.000 shape_base.py:234()\n", + " 2 0.000 0.000 0.000 0.000 shape_base.py:63(atleast_2d)\n", + " 100 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 499554 0.044 0.000 0.044 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 8.718 8.718 {built-in method builtins.exec}\n", + " 1504621 0.482 0.000 1.091 0.000 {built-in method builtins.getattr}\n", + " 2581 0.001 0.000 0.001 0.000 {built-in method builtins.hasattr}\n", + " 301 0.000 0.000 0.000 0.000 {built-in method builtins.hash}\n", + " 1013118 0.337 0.000 0.590 0.000 {built-in method builtins.isinstance}\n", + " 2089 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}\n", + " 199 0.000 0.000 0.000 0.000 {built-in method builtins.iter}\n", + "503778/502979 0.200 0.000 0.606 0.000 {built-in method builtins.len}\n", + " 499461 0.126 0.000 0.126 0.000 {built-in method builtins.max}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.sorted}\n", + " 100 0.000 0.000 0.000 0.000 {built-in method builtins.sum}\n", + " 305 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.arange}\n", + " 500052 0.145 0.000 0.145 0.000 {built-in method numpy.core.multiarray.array}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.concatenate}\n", + " 499 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.empty}\n", + " 297 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_platform_int}\n", + " 998811 0.122 0.000 0.122 0.000 {built-in method pandas._libs.lib.is_float}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 499265 0.061 0.000 0.061 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 4 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}\n", + " 497 0.000 0.000 0.003 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 499262 0.041 0.000 0.041 0.000 {method 'append' of 'list' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'argsort' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'clear' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 202 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}\n", + " 305 0.001 0.000 0.001 0.000 {method 'format' of 'str' objects}\n", + " 792 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}\n", + " 8 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}\n", + " 499257 0.452 0.000 0.452 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}\n", + " 199 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}\n", + " 99 0.001 0.000 0.001 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 497 0.003 0.000 0.003 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 198 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'search' of '_sre.SRE_Pattern' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'take' of 'numpy.ndarray' objects}\n", + " 99 0.002 0.000 0.002 0.000 {method 'tolist' of 'numpy.ndarray' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}\n", + " 499558 0.148 0.000 0.148 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 99 0.003 0.000 0.003 0.000 {pandas._libs.algos.take_2d_axis1_float64_float64}\n", + " 99 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}\n", + " 99 0.008 0.000 0.008 0.000 {pandas._libs.algos.take_2d_axis1_object_object}\n", + " 998712 0.411 0.000 1.443 0.000 {pandas._libs.lib.values_from_object}\n", + "\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 53174395 function calls (51174298 primitive calls) in 26.802 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 198 0.000 0.000 0.000 0.000 :416(parent)\n", + " 2574 0.002 0.000 0.004 0.000 :997(_handle_fromlist)\n", + " 1 0.065 0.065 26.802 26.802 :20(after)\n", + " 499257 0.860 0.000 21.422 0.000 :22()\n", + " 1 0.000 0.000 26.802 26.802 :1()\n", + " 99 0.000 0.000 0.000 0.000 __init__.py:205(iteritems)\n", + " 99 0.000 0.000 0.009 0.000 _decorators.py:136(wrapper)\n", + " 594 0.000 0.000 0.004 0.000 _methods.py:42(_any)\n", + " 99 0.000 0.000 0.001 0.000 _methods.py:45(_all)\n", + " 990 0.001 0.000 0.001 0.000 _weakrefset.py:70(__contains__)\n", + " 891 0.002 0.000 0.003 0.000 abc.py:180(__instancecheck__)\n", + " 396 0.001 0.000 0.001 0.000 algorithms.py:1421(_get_take_nd_function)\n", + " 396 0.004 0.000 0.033 0.000 algorithms.py:1548(take_nd)\n", + " 99 0.000 0.000 0.000 0.000 apply.py:101(agg_axis)\n", + " 99 0.001 0.000 26.656 0.269 apply.py:105(get_result)\n", + " 99 0.000 0.000 0.001 0.000 apply.py:14(frame_apply)\n", + " 99 0.003 0.000 26.655 0.269 apply.py:219(apply_standard)\n", + " 99 0.000 0.000 0.000 0.000 apply.py:34(__init__)\n", + " 99 0.000 0.000 0.000 0.000 apply.py:85(columns)\n", + " 99 0.000 0.000 0.192 0.002 apply.py:93(values)\n", + " 99 0.000 0.000 0.019 0.000 apply.py:97(dtypes)\n", + " 99 0.000 0.000 0.000 0.000 base.py:1442(_has_complex_internals)\n", + " 998514 0.450 0.000 1.705 0.000 base.py:1590(is_object)\n", + " 998514 0.935 0.000 2.817 0.000 base.py:1647(_convert_scalar_indexer)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1976(is_all_dates)\n", + " 998613 0.649 0.000 0.784 0.000 base.py:2033(__contains__)\n", + " 998514 0.670 0.000 3.158 0.000 base.py:2101(_can_hold_identifiers_and_holds_name)\n", + " 99 0.000 0.000 0.010 0.000 base.py:2179(take)\n", + " 99 0.000 0.000 0.002 0.000 base.py:2445(equals)\n", + " 99 0.002 0.000 0.007 0.000 base.py:255(__new__)\n", + " 998514 2.477 0.000 13.285 0.000 base.py:3090(get_value)\n", + " 99 0.000 0.000 0.001 0.000 base.py:473(_simple_new)\n", + " 999306 0.296 0.000 0.421 0.000 base.py:4914(_ensure_index)\n", + " 99 0.000 0.000 0.008 0.000 base.py:520(_shallow_copy_with_infer)\n", + " 9702 0.005 0.000 0.011 0.000 base.py:61(is_dtype)\n", + " 99 0.000 0.000 0.000 0.000 base.py:615(is_)\n", + " 99 0.000 0.000 0.000 0.000 base.py:635(_reset_identity)\n", + " 1999602 0.545 0.000 0.763 0.000 base.py:641(__len__)\n", + " 100 0.000 0.000 0.000 0.000 base.py:662(dtype)\n", + " 496 0.000 0.000 0.001 0.000 base.py:672(values)\n", + " 198 0.000 0.000 0.000 0.000 base.py:711(get_values)\n", + " 99 0.000 0.000 0.000 0.000 base.py:904(_coerce_to_ndarray)\n", + " 99 0.000 0.000 0.000 0.000 base.py:920(_get_attributes_dict)\n", + " 99 0.000 0.000 0.000 0.000 base.py:922()\n", + " 99 0.001 0.000 0.006 0.000 cast.py:1093(find_common_type)\n", + " 198 0.000 0.000 0.000 0.000 cast.py:1118()\n", + " 396 0.000 0.000 0.000 0.000 cast.py:1121()\n", + " 198 0.000 0.000 0.000 0.000 cast.py:1126()\n", + " 198 0.000 0.000 0.000 0.000 cast.py:1128()\n", + " 396 0.000 0.000 0.001 0.000 cast.py:1133()\n", + " 297 0.000 0.000 0.001 0.000 cast.py:1232(construct_1d_ndarray_preserving_na)\n", + " 297 0.001 0.000 0.008 0.000 cast.py:257(maybe_promote)\n", + " 495 0.001 0.000 0.001 0.000 cast.py:853(maybe_castable)\n", + " 99 0.001 0.000 0.002 0.000 cast.py:867(maybe_infer_to_datetimelike)\n", + " 297 0.002 0.000 0.008 0.000 cast.py:971(maybe_cast_to_datetime)\n", + " 99 0.000 0.000 0.001 0.000 common.py:100(is_bool_indexer)\n", + " 99 0.000 0.000 0.001 0.000 common.py:1043(is_datetime64_any_dtype)\n", + " 198 0.000 0.000 0.001 0.000 common.py:1170(is_datetime_or_timedelta_dtype)\n", + " 4257 0.001 0.000 0.010 0.000 common.py:122(is_sparse)\n", + " 99 0.000 0.000 0.001 0.000 common.py:1405(needs_i8_conversion)\n", + " 198 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)\n", + " 396 0.000 0.000 0.001 0.000 common.py:1578(is_bool_dtype)\n", + " 3267 0.003 0.000 0.028 0.000 common.py:1629(is_extension_type)\n", + " 1386 0.002 0.000 0.008 0.000 common.py:1688(is_extension_array_dtype)\n", + " 1089 0.001 0.000 0.001 0.000 common.py:1784(_get_dtype)\n", + " 1001286 0.431 0.000 0.562 0.000 common.py:1835(_get_dtype_type)\n", + " 3564 0.002 0.000 0.011 0.000 common.py:195(is_categorical)\n", + " 396 0.002 0.000 0.002 0.000 common.py:1965(pandas_dtype)\n", + " 4752 0.002 0.000 0.015 0.000 common.py:227(is_datetimetz)\n", + " 594 0.001 0.000 0.002 0.000 common.py:332(is_datetime64_dtype)\n", + " 5148 0.002 0.000 0.008 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 998613 0.237 0.000 0.333 0.000 common.py:395(_apply_if_callable)\n", + " 396 0.000 0.000 0.001 0.000 common.py:407(is_timedelta64_dtype)\n", + " 99 0.000 0.000 0.000 0.000 common.py:444(is_period_dtype)\n", + " 693 0.000 0.000 0.002 0.000 common.py:477(is_interval_dtype)\n", + " 3960 0.002 0.000 0.006 0.000 common.py:513(is_categorical_dtype)\n", + " 99 0.000 0.000 0.000 0.000 common.py:546(is_string_dtype)\n", + " 495 0.000 0.000 0.001 0.000 common.py:692(is_dtype_equal)\n", + " 99 0.000 0.000 0.000 0.000 common.py:811(is_integer_dtype)\n", + " 99 0.000 0.000 0.000 0.000 common.py:858(is_signed_integer_dtype)\n", + " 999306 0.563 0.000 1.258 0.000 common.py:89(is_object_dtype)\n", + " 99 0.000 0.000 0.000 0.000 common.py:995(is_int_or_datetime_dtype)\n", + " 198 0.000 0.000 0.000 0.000 dtypes.py:266(construct_from_string)\n", + " 99 0.000 0.000 0.001 0.000 dtypes.py:401(__new__)\n", + " 99 0.000 0.000 0.001 0.000 dtypes.py:459(construct_from_string)\n", + " 99 0.000 0.000 0.000 0.000 dtypes.py:584(is_dtype)\n", + " 594 0.001 0.000 0.002 0.000 dtypes.py:707(is_dtype)\n", + " 99 0.001 0.000 0.079 0.001 frame.py:2664(__getitem__)\n", + " 99 0.001 0.000 0.077 0.001 frame.py:2707(_getitem_array)\n", + " 99 0.000 0.000 0.000 0.000 frame.py:320(_constructor)\n", + " 99 0.000 0.000 0.001 0.000 frame.py:334(__init__)\n", + " 99 0.000 0.000 0.000 0.000 frame.py:555(shape)\n", + " 99 0.001 0.000 26.658 0.269 frame.py:5837(apply)\n", + " 99 0.000 0.000 0.000 0.000 frame.py:7047(_get_agg_axis)\n", + " 99 0.000 0.000 0.000 0.000 function.py:38(__call__)\n", + " 693 0.001 0.000 0.001 0.000 generic.py:124(__init__)\n", + " 99 0.000 0.000 0.001 0.000 generic.py:1490(__hash__)\n", + " 198 0.000 0.000 0.002 0.000 generic.py:164(_validate_dtype)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:178(_init_mgr)\n", + " 99 0.000 0.000 0.000 0.000 generic.py:2603(_set_is_copy)\n", + " 99 0.001 0.000 0.068 0.001 generic.py:2783(_take)\n", + " 297 0.001 0.000 0.001 0.000 generic.py:364(_get_axis_number)\n", + " 297 0.001 0.000 0.001 0.000 generic.py:377(_get_axis_name)\n", + " 297 0.000 0.000 0.001 0.000 generic.py:390(_get_axis)\n", + " 99 0.000 0.000 0.001 0.000 generic.py:394(_get_block_manager_axis)\n", + " 297 0.000 0.000 0.001 0.000 generic.py:4345(__finalize__)\n", + " 999306 1.556 0.000 20.563 0.000 generic.py:4362(__getattr__)\n", + " 891 0.003 0.000 0.005 0.000 generic.py:4378(__setattr__)\n", + " 998613 0.367 0.000 0.633 0.000 generic.py:438(_info_axis)\n", + " 198 0.000 0.000 0.001 0.000 generic.py:4423(_protect_consolidate)\n", + " 198 0.000 0.000 0.001 0.000 generic.py:4433(_consolidate_inplace)\n", + " 198 0.000 0.000 0.001 0.000 generic.py:4436(f)\n", + " 99 0.000 0.000 0.192 0.002 generic.py:4563(values)\n", + " 99 0.001 0.000 0.019 0.000 generic.py:4765(dtypes)\n", + " 99 0.001 0.000 0.009 0.000 generic.py:4890(astype)\n", + " 1020096 0.331 0.000 0.509 0.000 generic.py:7(_check)\n", + " 99 0.000 0.000 0.010 0.000 generic.py:9675(logical_func)\n", + " 99 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)\n", + " 99 0.000 0.000 0.003 0.000 indexing.py:2345(check_bool_indexer)\n", + " 99 0.002 0.000 0.004 0.000 indexing.py:2441(maybe_convert_indices)\n", + " 198 0.000 0.000 0.000 0.000 inference.py:119(is_iterator)\n", + " 891 0.001 0.000 0.005 0.000 inference.py:251(is_list_like)\n", + " 99 0.000 0.000 0.000 0.000 inference.py:364(is_dict_like)\n", + " 499356 0.156 0.000 0.266 0.000 inference.py:415(is_hashable)\n", + " 99 0.000 0.000 0.000 0.000 inspect.py:73(isclass)\n", + " 891 0.002 0.000 0.005 0.000 internals.py:116(__init__)\n", + " 297 0.001 0.000 0.034 0.000 internals.py:1237(take_nd)\n", + " 891 0.001 0.000 0.001 0.000 internals.py:127(_check_ndim)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:184(is_categorical_astype)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:199(external_values)\n", + " 998613 0.147 0.000 0.147 0.000 internals.py:203(internal_values)\n", + " 297 0.000 0.000 0.109 0.000 internals.py:213(get_values)\n", + " 998613 0.327 0.000 0.663 0.000 internals.py:222(to_dense)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:229(fill_value)\n", + " 495 0.001 0.000 0.004 0.000 internals.py:2298(__init__)\n", + " 2178 0.001 0.000 0.001 0.000 internals.py:233(mgr_locs)\n", + " 891 0.001 0.000 0.001 0.000 internals.py:237(mgr_locs)\n", + " 396 0.000 0.000 0.003 0.000 internals.py:269(make_block_same_class)\n", + " 495 0.002 0.000 0.009 0.000 internals.py:3148(get_block_type)\n", + " 891 0.002 0.000 0.017 0.000 internals.py:3191(make_block)\n", + " 99 0.001 0.000 0.009 0.000 internals.py:3265(__init__)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:3266()\n", + " 594 0.001 0.000 0.003 0.000 internals.py:3307(shape)\n", + " 1782 0.001 0.000 0.002 0.000 internals.py:3309()\n", + " 495 0.000 0.000 0.000 0.000 internals.py:3311(ndim)\n", + " 499257 0.527 0.000 1.341 0.000 internals.py:3315(set_axis)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:3351(_is_single_block)\n", + " 99 0.002 0.000 0.005 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 99 0.000 0.000 0.005 0.000 internals.py:3404(get_dtypes)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:3405()\n", + " 198 0.000 0.000 0.000 0.000 internals.py:3473(__len__)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:348(shape)\n", + " 99 0.001 0.000 0.002 0.000 internals.py:3488(_verify_integrity)\n", + " 396 0.000 0.000 0.000 0.000 internals.py:3490()\n", + " 99 0.001 0.000 0.005 0.000 internals.py:3500(apply)\n", + " 1000098 0.218 0.000 0.218 0.000 internals.py:352(dtype)\n", + " 297 0.000 0.000 0.001 0.000 internals.py:356(ftype)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:3561()\n", + " 99 0.000 0.000 0.005 0.000 internals.py:3713(astype)\n", + " 495 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)\n", + " 99 0.000 0.000 0.001 0.000 internals.py:3784(_consolidate_check)\n", + " 99 0.000 0.000 0.001 0.000 internals.py:3785()\n", + " 99 0.000 0.000 0.000 0.000 internals.py:3789(is_mixed_type)\n", + " 99 0.000 0.000 0.191 0.002 internals.py:3922(as_array)\n", + " 99 0.056 0.001 0.190 0.002 internals.py:3953(_interleave)\n", + " 198 0.000 0.000 0.000 0.000 internals.py:4085(consolidate)\n", + " 297 0.000 0.000 0.000 0.000 internals.py:4101(_consolidate_inplace)\n", + " 99 0.001 0.000 0.044 0.000 internals.py:4388(reindex_indexer)\n", + " 99 0.000 0.000 0.034 0.000 internals.py:4423()\n", + " 99 0.001 0.000 0.062 0.001 internals.py:4518(take)\n", + " 594 0.002 0.000 0.017 0.000 internals.py:4639(__init__)\n", + " 3495591 0.682 0.000 0.682 0.000 internals.py:4684(_block)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:4709(index)\n", + " 998811 0.529 0.000 0.956 0.000 internals.py:4718(dtype)\n", + " 297 0.000 0.000 0.001 0.000 internals.py:4742(external_values)\n", + " 998613 0.566 0.000 0.914 0.000 internals.py:4745(internal_values)\n", + " 998613 0.782 0.000 1.901 0.000 internals.py:4752(get_values)\n", + " 198 0.000 0.000 0.000 0.000 internals.py:4774(_consolidate_inplace)\n", + " 99 0.000 0.000 0.007 0.000 internals.py:5044(_interleaved_dtype)\n", + " 99 0.000 0.000 0.000 0.000 internals.py:5048()\n", + " 99 0.000 0.000 0.000 0.000 internals.py:5101(_extend_blocks)\n", + " 99 0.000 0.000 0.003 0.000 internals.py:573(astype)\n", + " 99 0.001 0.000 0.003 0.000 internals.py:577(_astype)\n", + " 99 0.000 0.000 0.001 0.000 internals.py:774(copy)\n", + " 99 0.000 0.000 0.004 0.000 missing.py:112(_isna_new)\n", + " 99 0.001 0.000 0.003 0.000 missing.py:189(_isna_ndarraylike)\n", + " 99 0.000 0.000 0.004 0.000 missing.py:32(isna)\n", + " 99 0.000 0.000 0.000 0.000 missing.py:376(array_equivalent)\n", + " 99 0.000 0.000 0.000 0.000 nanops.py:179(_get_fill_value)\n", + " 99 0.001 0.000 0.006 0.000 nanops.py:202(_get_values)\n", + " 99 0.000 0.000 0.000 0.000 nanops.py:256(_na_ok_dtype)\n", + " 99 0.000 0.000 0.000 0.000 nanops.py:260(_view_if_needed)\n", + " 99 0.000 0.000 0.007 0.000 nanops.py:318(nanany)\n", + " 99 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)\n", + " 396 0.001 0.000 0.003 0.000 numeric.py:2491(seterr)\n", + " 396 0.001 0.000 0.001 0.000 numeric.py:2592(geterr)\n", + " 198 0.000 0.000 0.000 0.000 numeric.py:2887(__init__)\n", + " 198 0.000 0.000 0.002 0.000 numeric.py:2891(__enter__)\n", + " 198 0.000 0.000 0.001 0.000 numeric.py:2896(__exit__)\n", + " 99 0.000 0.000 0.002 0.000 numeric.py:35(__new__)\n", + " 693 0.000 0.000 0.003 0.000 numeric.py:433(asarray)\n", + " 99 0.000 0.000 0.001 0.000 numeric.py:504(asanyarray)\n", + " 99 0.000 0.000 0.008 0.000 numeric.py:64(_shallow_copy)\n", + " 99 0.000 0.000 0.000 0.000 numerictypes.py:1001()\n", + " 99 0.000 0.000 0.000 0.000 numerictypes.py:1002()\n", + " 198 0.002 0.000 0.003 0.000 numerictypes.py:927(_can_coerce_all)\n", + " 1881 0.001 0.000 0.001 0.000 numerictypes.py:936()\n", + " 99 0.000 0.000 0.003 0.000 numerictypes.py:950(find_common_type)\n", + " 99 0.000 0.000 0.009 0.000 range.py:260(_shallow_copy)\n", + " 198 0.000 0.000 0.001 0.000 range.py:315(equals)\n", + " 1287 0.001 0.000 0.002 0.000 range.py:481(__len__)\n", + " 594 0.006 0.000 0.061 0.000 series.py:166(__init__)\n", + " 99 0.002 0.000 0.048 0.000 series.py:3069(apply)\n", + " 99 0.001 0.000 0.010 0.000 series.py:3203(_reduce)\n", + " 198 0.000 0.000 0.000 0.000 series.py:349(_constructor)\n", + " 499851 0.796 0.000 2.627 0.000 series.py:365(_set_axis)\n", + " 499851 0.255 0.000 0.255 0.000 series.py:391(_set_subtyp)\n", + " 792 0.001 0.000 0.002 0.000 series.py:401(name)\n", + " 495 0.003 0.000 0.022 0.000 series.py:4019(_sanitize_array)\n", + " 495 0.002 0.000 0.016 0.000 series.py:4036(_try_cast)\n", + " 500049 0.378 0.000 0.645 0.000 series.py:405(name)\n", + " 998811 0.448 0.000 1.404 0.000 series.py:412(dtype)\n", + " 297 0.000 0.000 0.001 0.000 series.py:432(values)\n", + " 998613 0.412 0.000 1.326 0.000 series.py:465(_values)\n", + " 998613 0.374 0.000 2.275 0.000 series.py:476(get_values)\n", + " 198 0.000 0.000 0.001 0.000 series.py:562(__len__)\n", + " 99 0.000 0.000 0.001 0.000 series.py:637(__array__)\n", + " 998514 1.469 0.000 15.215 0.000 series.py:764(__getitem__)\n", + " 99 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 396 0.000 0.000 0.001 0.000 {built-in method builtins.all}\n", + " 198 0.000 0.000 0.001 0.000 {built-in method builtins.any}\n", + " 998613 0.096 0.000 0.096 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 26.802 26.802 {built-in method builtins.exec}\n", + " 4026825 1.271 0.000 2.597 0.000 {built-in method builtins.getattr}\n", + " 4950 0.002 0.000 0.002 0.000 {built-in method builtins.hasattr}\n", + " 1497969 0.245 0.000 0.246 0.000 {built-in method builtins.hash}\n", + " 4047912 0.987 0.000 1.499 0.000 {built-in method builtins.isinstance}\n", + " 1006434 0.136 0.000 0.136 0.000 {built-in method builtins.issubclass}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method builtins.iter}\n", + "4009005/2009007 0.842 0.000 1.390 0.000 {built-in method builtins.len}\n", + " 1287 0.001 0.000 0.001 0.000 {built-in method builtins.max}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method builtins.sum}\n", + " 297 0.001 0.000 0.001 0.000 {built-in method numpy.core.multiarray.arange}\n", + "1000098/999999 0.284 0.000 0.285 0.000 {built-in method numpy.core.multiarray.array}\n", + " 792 0.017 0.000 0.017 0.000 {built-in method numpy.core.multiarray.empty}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.putmask}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}\n", + " 792 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.geterrobj}\n", + " 396 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.seterrobj}\n", + " 396 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 100 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_platform_int}\n", + " 99 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.infer_datetimelike_array}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_datetime_array}\n", + " 998811 0.145 0.000 0.145 0.000 {built-in method pandas._libs.lib.is_float}\n", + " 297 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 998613 0.128 0.000 0.128 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 99 0.000 0.000 0.002 0.000 {method 'all' of 'numpy.ndarray' objects}\n", + " 594 0.000 0.000 0.004 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}\n", + " 297 0.108 0.000 0.108 0.000 {method 'astype' of 'numpy.ndarray' objects}\n", + " 198 0.000 0.000 0.000 0.000 {method 'copy' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 198 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}\n", + " 396 0.001 0.000 0.001 0.000 {method 'format' of 'str' objects}\n", + " 990 0.001 0.000 0.001 0.000 {method 'get' of 'dict' objects}\n", + " 998514 1.023 0.000 1.023 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}\n", + " 198 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}\n", + " 99 0.002 0.000 0.002 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 297 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}\n", + " 693 0.005 0.000 0.005 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 198 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'search' of '_sre.SRE_Pattern' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'take' of 'numpy.ndarray' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'transpose' of 'numpy.ndarray' objects}\n", + " 99 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}\n", + " 999109 0.336 0.000 0.336 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 99 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_object_object}\n", + " 99 0.003 0.000 0.003 0.000 {pandas._libs.algos.take_2d_axis1_float64_float64}\n", + " 99 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_2d_axis1_int64_int64}\n", + " 99 0.006 0.000 0.006 0.000 {pandas._libs.algos.take_2d_axis1_object_object}\n", + " 99 0.004 0.000 0.028 0.000 {pandas._libs.lib.map_infer}\n", + " 1997325 0.915 0.000 3.190 0.000 {pandas._libs.lib.values_from_object}\n", + " 99 1.553 0.016 26.354 0.266 {pandas._libs.reduction.reduce}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "import cProfile\n", + "import pandas as pd\n", + "movies = pd.read_csv('./csv/movie_metadata.csv')\n", + "movies['rat'] = movies.content_rating.astype(str)\n", + "\n", + "\n", + "def before(movies):\n", + " for i in range(1, 100):\n", + " is_part = []\n", + " title_list = movies['movie_title']\n", + " genre_list = movies['content_rating']\n", + " for i, genre in enumerate(genre_list):\n", + " if str(genre) in title_list[i]:\n", + " is_part.append(True)\n", + " else:\n", + " is_part.append(False)\n", + " movies[is_part]\n", + "\n", + "\n", + "def after(movies):\n", + " for i in range(1, 100):\n", + " movies[movies.apply(lambda x: x.rat in x.movie_title, axis=1)]\n", + "\n", + "\n", + "cProfile.run(\"before(movies)\")\n", + "cProfile.run(\"after(movies)\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/notebooks/Pandas search in column, every column and regex.ipynb b/notebooks/Pandas search in column, every column and regex.ipynb new file mode 100644 index 0000000..c64e94f --- /dev/null +++ b/notebooks/Pandas search in column, every column and regex.ipynb @@ -0,0 +1,1234 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas search in column, every column and regex of a dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Example PDFs\n", + "\n", + "* Food Calories List\n", + "\n", + "http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## With tabula-py\n", + "\n", + "#### Installation\n", + "\n", + "https://pypi.org/project/tabula-py/\n", + "\n", + "`pip install pandas`\n", + "`pip install tabula-py`\n", + "\n", + "#### tabula-py docs\n", + "\n", + "https://www.pydoc.io/pypi/tabula-py-0.9.0/autoapi/wrapper/index.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Read tabular data from PDF" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from tabula import read_pdf\n", + "from tabulate import tabulate" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitCalories per pieceCarbs (grams)Water Content
0Apple (1 average)44 calories10.585 %
1Apple cooking35 calories988 %
2Apricot30 calories6.785 %
3Avocado150 calories260 %
4Banana107 calories2675 %
5Blackberries each1 calorie0.285 %
6Blackcurrant each1.1 calorie0.2577 %
7Blueberries (new) 100g49 Cals ( 100g )15 g81 %
8Cherry each2.4 calories0.683 %
9Clementine24 cals566 %
10Currants5 calories1.416 %
11Damson28 calories7.270 %
12One average date 5g5 cals1.214 %
13Dates with inverted sugar 100g250 calories6312 %
14Figs10 calories2.424 %
15Gooseberries2.6 calories0.6580 %
16Grapes 100g Seedless50 cals1582 %
17one average Grape 6g3 calories0.982 %
18Grapefruit whole100 calories2365 %
19Guava24 calories4.485 %
20Kiwi34 calories875 %
21Lemon20 calories3.485 %
22Lychees3 calories0.780 %
23Mango40 calories9.580 %
24Melon Honeydew (130g)36 calories990 %
25Melon Canteloupe (130g)25 cals693 %
26Nectarines42 calories980 %
27Olives6.8 caloriestrace63 %
\n", + "
" + ], + "text/plain": [ + " Fruit Calories per piece Carbs (grams) \\\n", + "0 Apple (1 average) 44 calories 10.5 \n", + "1 Apple cooking 35 calories 9 \n", + "2 Apricot 30 calories 6.7 \n", + "3 Avocado 150 calories 2 \n", + "4 Banana 107 calories 26 \n", + "5 Blackberries each 1 calorie 0.2 \n", + "6 Blackcurrant each 1.1 calorie 0.25 \n", + "7 Blueberries (new) 100g 49 Cals ( 100g ) 15 g \n", + "8 Cherry each 2.4 calories 0.6 \n", + "9 Clementine 24 cals 5 \n", + "10 Currants 5 calories 1.4 \n", + "11 Damson 28 calories 7.2 \n", + "12 One average date 5g 5 cals 1.2 \n", + "13 Dates with inverted sugar 100g 250 calories 63 \n", + "14 Figs 10 calories 2.4 \n", + "15 Gooseberries 2.6 calories 0.65 \n", + "16 Grapes 100g Seedless 50 cals 15 \n", + "17 one average Grape 6g 3 calories 0.9 \n", + "18 Grapefruit whole 100 calories 23 \n", + "19 Guava 24 calories 4.4 \n", + "20 Kiwi 34 calories 8 \n", + "21 Lemon 20 calories 3.4 \n", + "22 Lychees 3 calories 0.7 \n", + "23 Mango 40 calories 9.5 \n", + "24 Melon Honeydew (130g) 36 calories 9 \n", + "25 Melon Canteloupe (130g) 25 cals 6 \n", + "26 Nectarines 42 calories 9 \n", + "27 Olives 6.8 calories trace \n", + "\n", + " Water Content \n", + "0 85 % \n", + "1 88 % \n", + "2 85 % \n", + "3 60 % \n", + "4 75 % \n", + "5 85 % \n", + "6 77 % \n", + "7 81 % \n", + "8 83 % \n", + "9 66 % \n", + "10 16 % \n", + "11 70 % \n", + "12 14 % \n", + "13 12 % \n", + "14 24 % \n", + "15 80 % \n", + "16 82 % \n", + "17 82 % \n", + "18 65 % \n", + "19 85 % \n", + "20 75 % \n", + "21 85 % \n", + "22 80 % \n", + "23 80 % \n", + "24 90 % \n", + "25 93 % \n", + "26 80 % \n", + "27 63 % " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=8)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataframe Search in a single column for a string" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitCalories per pieceCarbs (grams)Water Content
0Apple (1 average)44 calories10.585 %
7Blueberries (new) 100g49 Cals ( 100g )15 g81 %
13Dates with inverted sugar 100g250 calories6312 %
16Grapes 100g Seedless50 cals1582 %
24Melon Honeydew (130g)36 calories990 %
25Melon Canteloupe (130g)25 cals693 %
\n", + "
" + ], + "text/plain": [ + " Fruit Calories per piece Carbs (grams) \\\n", + "0 Apple (1 average) 44 calories 10.5 \n", + "7 Blueberries (new) 100g 49 Cals ( 100g ) 15 g \n", + "13 Dates with inverted sugar 100g 250 calories 63 \n", + "16 Grapes 100g Seedless 50 cals 15 \n", + "24 Melon Honeydew (130g) 36 calories 9 \n", + "25 Melon Canteloupe (130g) 25 cals 6 \n", + "\n", + " Water Content \n", + "0 85 % \n", + "7 81 % \n", + "13 12 % \n", + "16 82 % \n", + "24 90 % \n", + "25 93 % " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('1')\n", + "df[df['Fruit'].str.contains(\"1\")]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitWater Content
24Melon Honeydew (130g)90 %
25Melon Canteloupe (130g)93 %
\n", + "
" + ], + "text/plain": [ + " Fruit Water Content\n", + "24 Melon Honeydew (130g) 90 %\n", + "25 Melon Canteloupe (130g) 93 %" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['Fruit'].str.contains(\"Melon\")][['Fruit', 'Water Content']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataframe Search in every column for a string" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitCalories per pieceCarbs (grams)Water Content
0Apple (1 average)44 calories10.585 %
3Avocado150 calories260 %
4Banana107 calories2675 %
5Blackberries each1 calorie0.285 %
6Blackcurrant each1.1 calorie0.2577 %
7Blueberries (new) 100g49 Cals ( 100g )15 g81 %
10Currants5 calories1.416 %
12One average date 5g5 cals1.214 %
13Dates with inverted sugar 100g250 calories6312 %
14Figs10 calories2.424 %
16Grapes 100g Seedless50 cals1582 %
18Grapefruit whole100 calories2365 %
24Melon Honeydew (130g)36 calories990 %
25Melon Canteloupe (130g)25 cals693 %
\n", + "
" + ], + "text/plain": [ + " Fruit Calories per piece Carbs (grams) \\\n", + "0 Apple (1 average) 44 calories 10.5 \n", + "3 Avocado 150 calories 2 \n", + "4 Banana 107 calories 26 \n", + "5 Blackberries each 1 calorie 0.2 \n", + "6 Blackcurrant each 1.1 calorie 0.25 \n", + "7 Blueberries (new) 100g 49 Cals ( 100g ) 15 g \n", + "10 Currants 5 calories 1.4 \n", + "12 One average date 5g 5 cals 1.2 \n", + "13 Dates with inverted sugar 100g 250 calories 63 \n", + "14 Figs 10 calories 2.4 \n", + "16 Grapes 100g Seedless 50 cals 15 \n", + "18 Grapefruit whole 100 calories 23 \n", + "24 Melon Honeydew (130g) 36 calories 9 \n", + "25 Melon Canteloupe (130g) 25 cals 6 \n", + "\n", + " Water Content \n", + "0 85 % \n", + "3 60 % \n", + "4 75 % \n", + "5 85 % \n", + "6 77 % \n", + "7 81 % \n", + "10 16 % \n", + "12 14 % \n", + "13 12 % \n", + "14 24 % \n", + "16 82 % \n", + "18 65 % \n", + "24 90 % \n", + "25 93 % " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('1')\n", + "df2= df[df.apply(lambda row: row.astype(str).str.contains('1').any(), axis=1)]\n", + "df2" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Melon\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitWater Content
24Melon Honeydew (130g)90 %
25Melon Canteloupe (130g)93 %
\n", + "
" + ], + "text/plain": [ + " Fruit Water Content\n", + "24 Melon Honeydew (130g) 90 %\n", + "25 Melon Canteloupe (130g) 93 %" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('Melon')\n", + "df2 = df[df.apply(lambda row: row.astype(str).str.contains('Melon').any(), axis=1)][['Fruit', 'Water Content']]\n", + "df2" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cals\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FruitCalories per pieceCarbs (grams)Water Content
9Clementine24 cals566 %
12One average date 5g5 cals1.214 %
16Grapes 100g Seedless50 cals1582 %
25Melon Canteloupe (130g)25 cals693 %
\n", + "
" + ], + "text/plain": [ + " Fruit Calories per piece Carbs (grams) Water Content\n", + "9 Clementine 24 cals 5 66 %\n", + "12 One average date 5g 5 cals 1.2 14 %\n", + "16 Grapes 100g Seedless 50 cals 15 82 %\n", + "25 Melon Canteloupe (130g) 25 cals 6 93 %" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('cals')\n", + "df[df.apply(lambda row: row.astype(str).str.contains('cals').any(), axis=1)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataframe search with regular expression" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Apple (1 average),44 calories,10.5,85 %',\n", + " 'Apple cooking,35 calories,9,88 %',\n", + " 'Apricot,30 calories,6.7,85 %',\n", + " 'Avocado,150 calories,2,60 %',\n", + " 'Banana,107 calories,26,75 %',\n", + " 'Blackberries each,1 calorie,0.2,85 %',\n", + " 'Blackcurrant each,1.1 calorie,0.25,77 %',\n", + " 'Blueberries (new) 100g,49 Cals ( 100g ),15 g,81 %',\n", + " 'Cherry each,2.4 calories,0.6,83 %',\n", + " 'Clementine,24 cals,5,66 %',\n", + " 'Currants,5 calories,1.4,16 %',\n", + " 'Damson,28 calories,7.2,70 %',\n", + " 'One average date 5g,5 cals,1.2,14 %',\n", + " 'Dates with inverted sugar 100g,250 calories,63,12 %',\n", + " 'Figs,10 calories,2.4,24 %',\n", + " 'Gooseberries,2.6 calories,0.65,80 %',\n", + " 'Grapes 100g Seedless,50 cals,15,82 %',\n", + " 'one average Grape 6g,3 calories,0.9,82 %',\n", + " 'Grapefruit whole,100 calories,23,65 %',\n", + " 'Guava,24 calories,4.4,85 %',\n", + " 'Kiwi,34 calories,8,75 %',\n", + " 'Lemon,20 calories,3.4,85 %',\n", + " 'Lychees,3 calories,0.7,80 %',\n", + " 'Mango,40 calories,9.5,80 %',\n", + " 'Melon Honeydew (130g),36 calories,9,90 %',\n", + " 'Melon Canteloupe (130g),25 cals,6,93 %',\n", + " 'Nectarines,42 calories,9,80 %',\n", + " 'Olives,6.8 calories,trace,63 %']" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vals = df.to_csv(header=None, index=False).strip('\\n').split('\\n')\n", + "vals" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['10.5']\n", + "['9,88']\n", + "['6.7']\n", + "['150', '2,60']\n", + "['107', '26,75']\n", + "['0.2']\n", + "['1.1', '0.25']\n", + "['100', '100']\n", + "['2.4', '0.6']\n", + "['5,66']\n", + "['1.4']\n", + "['7.2']\n", + "['1.2']\n", + "['100', '250', '63,12']\n", + "['2.4']\n", + "['2.6', '0.65']\n", + "['100', '15,82']\n", + "['0.9']\n", + "['100', '23,65']\n", + "['4.4']\n", + "['8,75']\n", + "['3.4']\n", + "['0.7']\n", + "['9.5']\n", + "['130', '9,90']\n", + "['130', '6,93']\n", + "['9,80']\n", + "['6.8']\n" + ] + } + ], + "source": [ + "import re\n", + "for val in vals:\n", + " #print(val)\n", + " found = re.findall(\"\\d+.\\d+\",val)\n", + " if found:\n", + " print(found)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123456789...18192021222324252627
FruitApple (1 average)Apple cookingApricotAvocadoBananaBlackberries eachBlackcurrant eachBlueberries (new) 100gCherry eachClementine...Grapefruit wholeGuavaKiwiLemonLycheesMangoMelon Honeydew (130g)Melon Canteloupe (130g)NectarinesOlives
Calories per piece44 calories35 calories30 calories150 calories107 calories1 calorie1.1 calorie49 Cals ( 100g )2.4 calories24 cals...100 calories24 calories34 calories20 calories3 calories40 calories36 calories25 cals42 calories6.8 calories
Carbs (grams)10.596.72260.20.2515 g0.65...234.483.40.79.5969trace
Water Content85 %88 %85 %60 %75 %85 %77 %81 %83 %66 %...65 %85 %75 %85 %80 %80 %90 %93 %80 %63 %
\n", + "

4 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " 0 1 2 \\\n", + "Fruit Apple (1 average) Apple cooking Apricot \n", + "Calories per piece 44 calories 35 calories 30 calories \n", + "Carbs (grams) 10.5 9 6.7 \n", + "Water Content 85 % 88 % 85 % \n", + "\n", + " 3 4 5 \\\n", + "Fruit Avocado Banana Blackberries each \n", + "Calories per piece 150 calories 107 calories 1 calorie \n", + "Carbs (grams) 2 26 0.2 \n", + "Water Content 60 % 75 % 85 % \n", + "\n", + " 6 7 8 \\\n", + "Fruit Blackcurrant each Blueberries (new) 100g Cherry each \n", + "Calories per piece 1.1 calorie 49 Cals ( 100g ) 2.4 calories \n", + "Carbs (grams) 0.25 15 g 0.6 \n", + "Water Content 77 % 81 % 83 % \n", + "\n", + " 9 ... 18 19 \\\n", + "Fruit Clementine ... Grapefruit whole Guava \n", + "Calories per piece 24 cals ... 100 calories 24 calories \n", + "Carbs (grams) 5 ... 23 4.4 \n", + "Water Content 66 % ... 65 % 85 % \n", + "\n", + " 20 21 22 23 \\\n", + "Fruit Kiwi Lemon Lychees Mango \n", + "Calories per piece 34 calories 20 calories 3 calories 40 calories \n", + "Carbs (grams) 8 3.4 0.7 9.5 \n", + "Water Content 75 % 85 % 80 % 80 % \n", + "\n", + " 24 25 \\\n", + "Fruit Melon Honeydew (130g) Melon Canteloupe (130g) \n", + "Calories per piece 36 calories 25 cals \n", + "Carbs (grams) 9 6 \n", + "Water Content 90 % 93 % \n", + "\n", + " 26 27 \n", + "Fruit Nectarines Olives \n", + "Calories per piece 42 calories 6.8 calories \n", + "Carbs (grams) 9 trace \n", + "Water Content 80 % 63 % \n", + "\n", + "[4 rows x 28 columns]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.T" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Python Extract Table from PDF.ipynb b/notebooks/Python Extract Table from PDF.ipynb new file mode 100644 index 0000000..fbdb305 --- /dev/null +++ b/notebooks/Python Extract Table from PDF.ipynb @@ -0,0 +1,662 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Extract Table from PDF" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Example PDFs\n", + "\n", + "* McKinsey Global Institute Disruptive technologies\n", + "\n", + "https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%20Digital/Our%20Insights/Disruptive%20technologies/MGI_Disruptive_technologies_Full_report_May2013.ashx\n", + "\n", + "* Food Calories List\n", + "\n", + "http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## With tabula-py\n", + "\n", + "#### Installation\n", + "\n", + "https://pypi.org/project/tabula-py/\n", + "\n", + "`pip install tabula-py`\n", + "\n", + "#### tabula-py docs\n", + "\n", + "https://www.pydoc.io/pypi/tabula-py-0.9.0/autoapi/wrapper/index.html" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from tabula import read_pdf\n", + "from tabulate import tabulate" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"./tmp/pdf/Food Calories List.pdf\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\")\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"./tmp/pdf/Food Calories List.pdf\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'columns'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\")\n", + "df = df.dropna(axis='columns')\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"./tmp/pdf/Food Calories List.pdf\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpages\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mprint\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mtabulate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\", pages=3)\n", + "print (tabulate(df))" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"./tmp/pdf/Food Calories List.pdf\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpages\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput_format\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"json\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\", pages=3, output_format=\"json\")\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"./tmp/pdf/Food Calories List.pdf\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpages\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'all'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmultiple_tables\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\", pages='all', multiple_tables=True)\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Fish cake90 cals per cake200 calsMedium
0Fish fingers50 cals per piece220 calsMedium
1Gammon320 cals280 calsMed-High
2Haddock fresh200 cals110 calsLow calorie
3Halibut fresh220 cals125 calsLow calorie
4Ham6 cals240 calsMedium
5Herring fresh grilled300 cals200 calsMedium
6Kidney200 cals160 calsMedium
7Kipper200 cals120 calsLow calorie
8Liver200 cals150 calsMedium
9Liver pate150 cals300 calsMedium
10Lamb (roast)300 cals300 calsMed-High
11Lobster boiled200 cals100 calsLow calorie
12Luncheon meat300 cals400 calsHigh
13Mackeral320 cals300 calsMedium
14Mussels90 cals90 calsLow-Med
15Pheasant roast200 cals200 calsMedium
16Pilchards (tinned)140 cals140 calsMedium
17Prawns180 cals100 calsLow- Med
18Pork320 cals290 calsMed-High
19Pork pie320 cals450 calsHigh
20Rabbit200 cals180 calsMedium
21Salmon fresh220 cals180 calsMedium
22Sardines tinned in oil220 cals220 calsMedium
23Sardines in tomato sauce180 cals180 calsMedium
24Sausage pork fried250 cals320 calsHigh
25Sausage pork grilled220 cals280 calsMed-High
26Sausage roll290 cals480 calsHigh
27Scampi fried in oil400 cals340 calsHigh
28Steak & kidney pie400 cals350 calsHigh
\n", + "
" + ], + "text/plain": [ + " Fish cake 90 cals per cake 200 cals Medium\n", + "0 Fish fingers 50 cals per piece 220 cals Medium\n", + "1 Gammon 320 cals 280 cals Med-High\n", + "2 Haddock fresh 200 cals 110 cals Low calorie\n", + "3 Halibut fresh 220 cals 125 cals Low calorie\n", + "4 Ham 6 cals 240 cals Medium\n", + "5 Herring fresh grilled 300 cals 200 cals Medium\n", + "6 Kidney 200 cals 160 cals Medium\n", + "7 Kipper 200 cals 120 cals Low calorie\n", + "8 Liver 200 cals 150 cals Medium\n", + "9 Liver pate 150 cals 300 cals Medium\n", + "10 Lamb (roast) 300 cals 300 cals Med-High\n", + "11 Lobster boiled 200 cals 100 cals Low calorie\n", + "12 Luncheon meat 300 cals 400 cals High\n", + "13 Mackeral 320 cals 300 cals Medium\n", + "14 Mussels 90 cals 90 cals Low-Med\n", + "15 Pheasant roast 200 cals 200 cals Medium\n", + "16 Pilchards (tinned) 140 cals 140 cals Medium\n", + "17 Prawns 180 cals 100 cals Low- Med\n", + "18 Pork 320 cals 290 cals Med-High\n", + "19 Pork pie 320 cals 450 cals High\n", + "20 Rabbit 200 cals 180 cals Medium\n", + "21 Salmon fresh 220 cals 180 cals Medium\n", + "22 Sardines tinned in oil 220 cals 220 cals Medium\n", + "23 Sardines in tomato sauce 180 cals 180 cals Medium\n", + "24 Sausage pork fried 250 cals 320 cals High\n", + "25 Sausage pork grilled 220 cals 280 cals Med-High\n", + "26 Sausage roll 290 cals 480 cals High\n", + "27 Scampi fried in oil 400 cals 340 cals High\n", + "28 Steak & kidney pie 400 cals 350 cals High" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=3)\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\", encoding = 'ISO-8859-1',\n\u001b[0;32m----> 2\u001b[0;31m stream=True, area = [269.875, 12.75, 790.5, 961], pages = 4, guess = False, pandas_options={'header':None})\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/tabula/wrapper.py\u001b[0m in \u001b[0;36mread_pdf\u001b[0;34m(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, **kwargs)\u001b[0m\n\u001b[1;32m 103\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 104\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 105\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mFileNotFoundError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrerror\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merrno\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mENOENT\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 107\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "df = read_pdf(\"./tmp/pdf/Food Calories List.pdf\", encoding = 'ISO-8859-1',\n", + " stream=True, area = [269.875, 12.75, 790.5, 961], pages = 4, guess = False, pandas_options={'header':None})\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = read_pdf(\"./tmp/pdf/output.pdf\", encoding = 'ISO-8859-1',\n", + " stream=True, guess = False)\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = read_pdf(\"./tmp/pdf/output.pdf\", encoding = 'ISO-8859-1',\n", + " stream=True, area=[269.875, 12.75, 790.5, 961], guess = False)\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## With Camelot\n", + "\n", + "#### Installation\n", + "\n", + "https://pypi.org/project/camelot-py/\n", + "\n", + "`pip install camelot-py`\n", + "\n", + "#### Camelot readme\n", + "\n", + "https://github.com/socialcopsdev/camelot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import camelot\n", + "tables = camelot.read_pdf(\"./tmp/pdf//Food Calories List.pdf\")\n", + "tables[0].df[1:3]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tables1 = camelot.read_pdf(\"./tmp/pdf/MGI_Disruptive_technologies_Full_report_May2013.pdf\", pages='32', area=[269.875, 120.75, 790.5, 561])\n", + "print (tabulate(tables1[0].df))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for i in range(30,35):\n", + " print (i)\n", + " tables = camelot.read_pdf(\"./tmp/pdf/MGI_Disruptive_technologies_Full_report_May2013.pdf\", pages='%d' % i)\n", + " try:\n", + " print (tabulate(tables[0].df))\n", + " print (tabulate(tables[1].df))\n", + " except IndexError:\n", + " print('NOK')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extract by PyPDF2\n", + "\n", + "#### Installation\n", + "\n", + "https://pypi.org/project/PyPDF2/\n", + "\n", + "`pip install PyPDF2`" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "ename": "FileNotFoundError", + "evalue": "[Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mPyPDF2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mpdf_file\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mopen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'./tmp/pdf/Food Calories List.pdf'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'rb'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mread_pdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mPyPDF2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPdfFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpdf_file\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mnumber_of_pages\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetNumPages\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mpage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mread_pdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetPage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: './tmp/pdf/Food Calories List.pdf'" + ] + } + ], + "source": [ + "import PyPDF2\n", + "pdf_file = open('./tmp/pdf/Food Calories List.pdf', 'rb')\n", + "read_pdf = PyPDF2.PdfFileReader(pdf_file)\n", + "number_of_pages = read_pdf.getNumPages()\n", + "page = read_pdf.getPage(2)\n", + "page_content = page.extractText()\n", + "print (page_content.encode('utf-8'))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'page_content' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mtable_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpage_content\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0ml\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray_split\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable_list\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtable_list\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m: name 'page_content' is not defined" + ] + } + ], + "source": [ + "import numpy\n", + "\n", + "table_list = page_content.split('\\n')\n", + "l = numpy.array_split(table_list, len(table_list)/4)\n", + "for i in range(0,5):\n", + " print(l[i])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Python group and sort a list of lists by a specific index,pattern.ipynb b/notebooks/Python group and sort a list of lists by a specific index,pattern.ipynb new file mode 100644 index 0000000..396395f --- /dev/null +++ b/notebooks/Python group and sort a list of lists by a specific index,pattern.ipynb @@ -0,0 +1,424 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "movies = [\n", + "1, \"Avatar\" ,'good',\n", + "2, \"Titanic\" ,'not bad',\n", + "3, \"Star Wars: The Force Awakens\" ,'good',\n", + "4, \"Jurassic World\" ,'good',\n", + "5, \"The Avengers\" ,'not bad',\n", + "6, \"Furious 7\" ,'not bad',\n", + "7, \"Avengers: Age of Ultron\" ,'good',\n", + "8, \"Harry Potter and the Deathly Hallows – Part 2\" ,'not bad',\n", + "9, \"Frozen\" ,'good',\n", + "\n", + "\n", + "\"The Birth of a Nation\" ,1915,\n", + "\"The Birth of a Nation\" ,1940,\n", + "\"Gone with the Wind\" ,1940,\n", + "\"Gone with the Wind\" ,1963,\n", + "\"Gone with the Wind\" ,1963,\n", + "\"The Sound of Music\" ,1966]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def sortGroupList(list_unsorted, category, category2, short=True):\n", + " listx = []\n", + " listy = []\n", + " last_section = 0\n", + " for i in range(0, len(list_unsorted), 3):\n", + " if list_unsorted[i + 2] == category:\n", + " listy.append(list_unsorted[i])\n", + " listy.append(list_unsorted[i + 1])\n", + " if not short:\n", + " listy.append(list_unsorted[i + 2])\n", + " last_section = i+2\n", + " elif list_unsorted[i + 2] == category2:\n", + " listx.append(list_unsorted[i])\n", + " listx.append(list_unsorted[i + 1])\n", + " if not short:\n", + " listx.append(list_unsorted[i + 2])\n", + " last_section = i + 2\n", + " header_category = [' - ' + category + ' - ']\n", + " header_category2 = [' - ' + category2 + ' - ']\n", + " header_category3 = [' - ' + ' - ']\n", + " return header_category + listy + header_category2 + listx + header_category3 + list_unsorted[last_section:]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sortGroupList(movies, 'good', 'not bad')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "movies = [\n", + "1, \"Avatar\" ,2009,\n", + "2, \"Titanic\" ,1997,\n", + "3, \"Star Wars: The Force Awakens\" ,2015,\n", + "4, \"Jurassic World\" ,2015,\n", + "5, \"The Avengers\" ,2012,\n", + "6, \"Furious 7\" ,2015,\n", + "7, \"Avengers: Age of Ultron\" ,2015,\n", + "8, \"Harry Potter and the Deathly Hallows – Part 2\" ,2011,\n", + "9, \"Frozen\" ,2013,\n", + "\n", + "\n", + "\"The Birth of a Nation\" ,1915,\n", + "\"The Birth of a Nation\" ,1940,\n", + "\"Gone with the Wind\" ,1940,\n", + "\"Gone with the Wind\" ,1963,\n", + "\"The Sound of Music\" ,1966]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len(movies))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "years = [str(x) for x in range(1997, 2015)]\n", + "years" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def sortGroupList(list_unsorted):\n", + " listx = []\n", + " listy = []\n", + " for i in range(0, len(list_unsorted), 3):\n", + " if list_unsorted[i + 2] in years:\n", + " listy.append(list_unsorted[i])\n", + " listy.append(list_unsorted[i + 1])\n", + " listy.append(list_unsorted[i + 2])\n", + " else:\n", + " listx.append(list_unsorted[i])\n", + " listx.append(list_unsorted[i + 1])\n", + " listx.append(list_unsorted[i + 2])\n", + " for i in listy:\n", + " print(i)\n", + " for i in listx:\n", + " print(i)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sortGroupList(movies)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "movies = [\n", + "1, \"Avatar\" ,'good',\n", + "2, \"Titanic\" ,'not bad',\n", + "3, \"Star Wars: The Force Awakens\" ,'good',\n", + "4, \"Jurassic World\" ,'good',\n", + "5, \"The Avengers\" ,'not bad',\n", + "6, \"Furious 7\" ,'not bad',\n", + "7, \"Avengers: Age of Ultron\" ,'good',\n", + "8, \"Harry Potter and the Deathly Hallows – Part 2\" ,'not bad',\n", + "9, \"Frozen\" ,'good',\n", + "\n", + "\n", + "\"The Birth of a Nation\" ,1915,\n", + "\"The Birth of a Nation\" ,1940,\n", + "\"Gone with the Wind\" ,1940,\n", + "\"Gone with the Wind\" ,1963,\n", + "\"The Sound of Music\" ,1966]\n", + "df = pd.DataFrame(movies)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "types = []\n", + "raw_list = []\n", + "for e in movies:\n", + " types.append(type(e))\n", + " if isinstance(e, int):\n", + " raw_list.append(1)\n", + " else:\n", + " raw_list.append(0)\n", + "df1 = pd.DataFrame({'elem':movies, 'types':types}) " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "raw_list = [1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 0,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 1,\n", + " 0,\n", + " 1]\n", + "movies = [\n", + "1, \"Avatar\" ,'good',\n", + "2, \"Titanic\" ,'not bad',\n", + "3, \"Star Wars: The Force Awakens\" ,'good',\n", + "4, \"Jurassic World\" ,'good',\n", + "5, \"The Avengers\" ,'not bad',\n", + "6, \"Furious 7\" ,'not bad',\n", + "7, \"Avengers: Age of Ultron\" ,'good',\n", + "8, \"Harry Potter and the Deathly Hallows – Part 2\" ,'not bad',\n", + "9, \"Frozen\" ,'good',\n", + "\n", + "\n", + "\"The Birth of a Nation\" ,1915,\n", + "\"The Birth of a Nation\" ,1940,\n", + "\"Gone with the Wind\" ,1940,\n", + "\"Gone with the Wind\" ,1963,\n", + "\"The Sound of Music\" ,1966]" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0, 1]\n", + "[0, 1]\n", + "[0, 1]\n", + "[0, 1]\n", + "[0, 1]\n" + ] + }, + { + "data": { + "text/plain": [ + "[[1, 'Avatar', 'good'],\n", + " [2, 'Titanic', 'not bad'],\n", + " [3, 'Star Wars: The Force Awakens', 'good'],\n", + " [4, 'Jurassic World', 'good'],\n", + " [5, 'The Avengers', 'not bad'],\n", + " [6, 'Furious 7', 'not bad'],\n", + " [7, 'Avengers: Age of Ultron', 'good'],\n", + " [8, 'Harry Potter and the Deathly Hallows – Part 2', 'not bad'],\n", + " [9, 'Frozen', 'good']]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "patern1 = [1, 0, 0]\n", + "patern2 = [1, 0]\n", + "\n", + "len1 = len(patern1)\n", + "len2 = len(patern2)\n", + "\n", + "output1 = []\n", + "output2 = []\n", + "\n", + "while(raw_list):\n", + " if raw_list[:len1] == patern1: \n", + " output1.append(movies[:len1])\n", + " raw_list = raw_list[len1:]\n", + " movies = movies[len1:]\n", + " else:\n", + " print(raw_list[:len2])\n", + " output2.append(movies[:len2])\n", + " raw_list = raw_list[len2:]\n", + " movies = movies[len2:]\n", + " \n", + "output1" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['The Birth of a Nation', 1915],\n", + " ['The Birth of a Nation', 1940],\n", + " ['Gone with the Wind', 1940],\n", + " ['Gone with the Wind', 1963],\n", + " ['The Sound of Music', 1966]]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output2" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "new_list = sorted(output1, key=lambda x: x[2])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[1, 'Avatar', 'good'],\n", + " [3, 'Star Wars: The Force Awakens', 'good'],\n", + " [4, 'Jurassic World', 'good'],\n", + " [7, 'Avengers: Age of Ultron', 'good'],\n", + " [9, 'Frozen', 'good'],\n", + " [2, 'Titanic', 'not bad'],\n", + " [5, 'The Avengers', 'not bad'],\n", + " [6, 'Furious 7', 'not bad'],\n", + " [8, 'Harry Potter and the Deathly Hallows – Part 2', 'not bad']]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python make groups in a list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Simple grouping" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/notebooks/Python_group_or_sort_list_of_lists_by_common_element.ipynb b/notebooks/Python_group_or_sort_list_of_lists_by_common_element.ipynb new file mode 100644 index 0000000..e6ed72a --- /dev/null +++ b/notebooks/Python_group_or_sort_list_of_lists_by_common_element.ipynb @@ -0,0 +1,705 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python group or sort list of lists by common element\n", + "\n", + " * Grouping of lists of list by position\n", + " * Grouping of lists of list by key\n", + " * Sort and group flatten lists of lists\n", + " * Grouping list of lists different sizes\n", + " \n", + " #### Bonus tips\n", + " \n", + " \n", + " * Sort list of lists elements\n", + " * sort maps by key or value\n", + " * Iterating list over every two elements\n", + " * Iterating list over every N elements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# equaly sized list of lists \n", + "[[\"Linux\", 0], [\"Windows 7\",1], [\"Ubuntu\",0], [\"Windows 10\",1], [\"MacOS\",2], [\"Linux Mint\",0]]\n", + "\n", + "# Different sized list of lists \n", + "[[\"Linux\", 0, 22], [\"Windows 7\",1 , 5, 6], [\"Ubuntu\",0], [\"Linux Mint\"]]\n", + "\n", + "# flatten\n", + "[\"Linux\", 0, \"Windows 7\",1, \"Ubuntu\",0, \"Windows 10\",1, \"MacOS\",2, \"Linux Mint\",0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Grouping of lists of list by position (size 2)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['Linux', 'Ubuntu', 'Linux Mint'], ['Windows 7', 'Windows 10'], ['MacOS']]" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# equaly sized list of lists \n", + "raw_list = [[\"Linux\", 0], [\"Windows 7\",1], [\"Ubuntu\",0], [\"Windows 10\",1], [\"MacOS\",2], [\"Linux Mint\",0]]\n", + "\n", + "keys = set(map(lambda x:x[1], raw_list))\n", + "new_list = [[y[0] for y in raw_list if y[1]==x] for x in keys]\n", + "\n", + "new_list" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: ['Linux', 'Ubuntu', 'Linux Mint'],\n", + " 1: ['Windows 7', 'Windows 10'],\n", + " 2: ['MacOS']}" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\n", + "raw_list = [[\"Linux\", 0], [\"Windows 7\",1], [\"Ubuntu\",0], [\"Windows 10\",1], [\"MacOS\",2], [\"Linux Mint\",0]]\n", + "\n", + "keys = set(map(lambda x:x[1], raw_list))\n", + "new_list = {x:[y[0] for y in raw_list if y[1]==x] for x in keys}\n", + "\n", + "new_list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Grouping of lists of list by position (size 4)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Ubuntu': [['Xenial Xerus', 0.4], ['Bionic Beaver', 0]],\n", + " 'Linux Mint': [['Rosa', 17.3], ['Sonya', 18.2]]}" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "raw_list = [\n", + " ['Linux Mint', 17, 'Rosa', 17.3], \n", + " ['Linux Mint', 18, 'Sonya', 18.2],\n", + " ['Ubuntu', 16, 'Xenial Xerus', 0.4],\n", + " ['Ubuntu', 18, 'Bionic Beaver', 0]]\n", + "\n", + "keys = set(map(lambda x:x[0], raw_list))\n", + "unsorted_map = {x:[y[2:] for y in raw_list if y[0]==x] for x in keys}\n", + "\n", + "unsorted_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### List of list different size" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "raw_list = [\n", + " ['Linux Mint', 17, 'Rosa', 17.3], \n", + " ['Linux Mint', 18, 'Sonya', 18.2],\n", + " ['Ubuntu', 16, 'Xenial Xerus', 0.4],\n", + " ['Ubuntu', 18, 'Bionic Beaver', 0],\n", + " \n", + " ['Windows', 7, 'Home'],\n", + " ['Windows', 7, 'Profesional'],\n", + " ['Windows', 10, 'Ultimate']\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Ubuntu': [[16, 'Xenial Xerus'], [18, 'Bionic Beaver']],\n", + " 'Linux Mint': [[17, 'Rosa'], [18, 'Sonya']],\n", + " 'Windows': [[7, 'Home'], [7, 'Profesional'], [10, 'Ultimate']]}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "keys = set(map(lambda x:x[0], raw_list))\n", + "unsorted_map = {x:[y[1:3] for y in raw_list if y[0]==x] for x in keys}\n", + "unsorted_map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Sort python map by key or value" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Linux Mint', 'Ubuntu', 'Windows']" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sorted(unsorted_map.keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[[7, 'Home'], [7, 'Profesional'], [10, 'Ultimate']],\n", + " [[16, 'Xenial Xerus'], [18, 'Bionic Beaver']],\n", + " [[17, 'Rosa'], [18, 'Sonya']]]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sorted(unsorted_map.values())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sort list of lists by key" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Linux Mint: [[17, 'Rosa'], [18, 'Sonya']]\n", + "Ubuntu: [[16, 'Xenial Xerus'], [18, 'Bionic Beaver']]\n", + "Windows: [[7, 'Home'], [7, 'Profesional'], [10, 'Ultimate']]\n" + ] + } + ], + "source": [ + "for key in sorted(unsorted_map.keys()):\n", + " print (\"%s: %s\" % (key, unsorted_map[key]))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Windows: [[7, 'Home'], [7, 'Profesional'], [10, 'Ultimate']]\n", + "Ubuntu: [[16, 'Xenial Xerus'], [18, 'Bionic Beaver']]\n", + "Linux Mint: [[17, 'Rosa'], [18, 'Sonya']]\n" + ] + } + ], + "source": [ + "for key in sorted(unsorted_map.keys(), reverse=True):\n", + " print (\"%s: %s\" % (key, unsorted_map[key]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sort and group flatten lists of lists" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "os_list = [\n", + " 'Ubuntu 18',\n", + " 'This article informs you about Ubuntu 18.04 release date,',\n", + " 'Released',\n", + " \n", + " 'Ubuntu 20',\n", + " 'The desktop image allows you to try Ubuntu without changing y..',\n", + " 'Not Released',\n", + " \n", + " 'Ubuntu 19',\n", + " 'Ubuntu is an open source software operating system that runs from',\n", + " 'Released',\n", + " \n", + " 'Linux mint 18',\n", + " 'Linux Mint is an elegant, easy to use, up to date and comfortable',\n", + " 'Released',\n", + " \n", + " 'Linux mint 20',\n", + " 'Suggestion: For Mint 20 to go full Debian',\n", + " 'Not Released',\n", + " \n", + " 'Linux mint 19',\n", + " 'Linux Mint 19 is a long term support release which will be supported until 2023',\n", + " 'Released',\n", + "\n", + " 'Windows 7',\n", + " 'Windows 7 is a personal computer operating system that was ..',\n", + " 'Windows 10',\n", + " 'Windows 10 is a series of personal computer operating systems',\n", + " \"Windows XP\",\n", + " 'Windows XP is old, and Microsoft no longer provides official support']" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[1, 2]\n", + "[3, 4]\n", + "[5, 6]\n" + ] + } + ], + "source": [ + "# iterating over every two elements\n", + "test_list = [1, 2, 3, 4, 5, 6]\n", + "\n", + "for i in range(0, len(test_list), 2):\n", + " print (test_list[i:i+2])" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0, 1, 2]\n", + "[3, 4, 5]\n", + "[6, 7, 8]\n", + "[9]\n" + ] + } + ], + "source": [ + "# iterating over every N elements\n", + "test_list = list(range(0, 10))\n", + "\n", + "for i in range(0, len(test_list), 3):\n", + " print (test_list[i:i+3])" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Windows 7',\n", + " 'Windows 7 is a personal computer operating system that was ..',\n", + " 'Windows 10',\n", + " 'Windows 10 is a series of personal computer operating systems',\n", + " 'Windows XP',\n", + " 'Windows XP is old, and Microsoft no longer provides official support']" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list3 = []\n", + "last = 0\n", + "for i in range(0, len(os_list), 3):\n", + " if i+2 < len(os_list) and os_list[i+2] in ['Released', 'Not Released']:\n", + " list3.append(os_list[i:i+3])\n", + " last = i+3\n", + "list2 = os_list[last:]\n", + "list2" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['Ubuntu 18',\n", + " 'This article informs you about Ubuntu 18.04 release date,',\n", + " 'Released'],\n", + " ['Ubuntu 20',\n", + " 'The desktop image allows you to try Ubuntu without changing y..',\n", + " 'Not Released'],\n", + " ['Ubuntu 19',\n", + " 'Ubuntu is an open source software operating system that runs from',\n", + " 'Released'],\n", + " ['Linux mint 18',\n", + " 'Linux Mint is an elegant, easy to use, up to date and comfortable',\n", + " 'Released'],\n", + " ['Linux mint 20',\n", + " 'Suggestion: For Mint 20 to go full Debian',\n", + " 'Not Released'],\n", + " ['Linux mint 19',\n", + " 'Linux Mint 19 is a long term support release which will be supported until 2023',\n", + " 'Released']]" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list3" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "def sortList(working_list, category, category2):\n", + " listx = []\n", + " listy = []\n", + " last_section = 0\n", + " for i in range(0, len(os_list) - 3, 3):\n", + " if working_list[i + 2] == category:\n", + " listy.append(working_list[i])\n", + " listy.append(working_list[i + 1])\n", + " last_section = i + 2\n", + " elif working_list[i + 2] == category2:\n", + " listx.append(working_list[i])\n", + " listx.append(working_list[i + 1])\n", + " last_section = i + 2\n", + "\n", + " if last_section > 0:\n", + " listz = working_list[(last_section + 1):]\n", + " else:\n", + " listz = working_list[(last_section):]\n", + "\n", + " return listx, listy, listz" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Ubuntu 20',\n", + " 'The desktop image allows you to try Ubuntu without changing y..',\n", + " 'Linux mint 20',\n", + " 'Suggestion: For Mint 20 to go full Debian']" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "listx, listy, listz = sortList(os_list, 'Released', 'Not Released')\n", + "listx" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Ubuntu 18',\n", + " 'This article informs you about Ubuntu 18.04 release date,',\n", + " 'Ubuntu 19',\n", + " 'Ubuntu is an open source software operating system that runs from',\n", + " 'Linux mint 18',\n", + " 'Linux Mint is an elegant, easy to use, up to date and comfortable',\n", + " 'Linux mint 19',\n", + " 'Linux Mint 19 is a long term support release which will be supported until 2023']" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "listy" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Windows 7',\n", + " 'Windows 7 is a personal computer operating system that was ..',\n", + " 'Windows 10',\n", + " 'Windows 10 is a series of personal computer operating systems',\n", + " 'Windows XP',\n", + " 'Windows XP is old, and Microsoft no longer provides official support']" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "listz" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generic solution for flatten list" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['Ubuntu 18',\n", + " 'This article informs you about Ubuntu 18.04 release date,',\n", + " 'Released'],\n", + " ['Ubuntu 20',\n", + " 'The desktop image allows you to try Ubuntu without changing y..',\n", + " 'Not Released'],\n", + " ['Ubuntu 19',\n", + " 'Ubuntu is an open source software operating system that runs from',\n", + " 'Released'],\n", + " ['Linux mint 18',\n", + " 'Linux Mint is an elegant, easy to use, up to date and comfortable',\n", + " 'Released'],\n", + " ['Linux mint 20',\n", + " 'Suggestion: For Mint 20 to go full Debian',\n", + " 'Not Released'],\n", + " ['Linux mint 19',\n", + " 'Linux Mint 19 is a long term support release which will be supported until 2023',\n", + " 'Released']]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "os_list = [\n", + " \n", + " \n", + " 'Windows 10',\n", + " 'Windows 10 is a series of personal computer operating systems',\n", + " \"Windows XP\",\n", + " 'Windows XP is old, and Microsoft no longer provides official support',\n", + " \n", + " 'Ubuntu 18',\n", + " 'This article informs you about Ubuntu 18.04 release date,',\n", + " 'Released',\n", + "\n", + " 'Ubuntu 20',\n", + " 'The desktop image allows you to try Ubuntu without changing y..',\n", + " 'Not Released',\n", + "\n", + " 'Windows 7',\n", + " 'Windows 7 is a personal computer operating system that was ..',\n", + "\n", + " 'Ubuntu 19',\n", + " 'Ubuntu is an open source software operating system that runs from',\n", + " 'Released',\n", + "\n", + " 'Linux mint 18',\n", + " 'Linux Mint is an elegant, easy to use, up to date and comfortable',\n", + " 'Released',\n", + "\n", + " 'Linux mint 20',\n", + " 'Suggestion: For Mint 20 to go full Debian',\n", + " 'Not Released',\n", + "\n", + " 'Linux mint 19',\n", + " 'Linux Mint 19 is a long term support release which will be supported until 2023',\n", + " 'Released',\n", + "\n", + "]\n", + "\n", + "list3 = []\n", + "list2 = []\n", + "cur = 0\n", + "\n", + "os_list_tmp = os_list\n", + "\n", + "while cur <= len(os_list_tmp):\n", + " cur = 0\n", + " if cur+2 < len(os_list_tmp) and os_list_tmp[cur+2] in ['Released', 'Not Released']:\n", + " list3.append(os_list_tmp[cur:cur+3])\n", + " cur = cur + 3\n", + " else:\n", + " list2.append(os_list_tmp[cur:cur+2])\n", + " cur = cur + 2\n", + " os_list_tmp = os_list_tmp[cur:]\n", + "list3" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['Windows 10',\n", + " 'Windows 10 is a series of personal computer operating systems'],\n", + " ['Windows XP',\n", + " 'Windows XP is old, and Microsoft no longer provides official support'],\n", + " ['Windows 7',\n", + " 'Windows 7 is a personal computer operating system that was ..']]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/Q&A/Questions_and_Answers_1_Improve_OCR_and_tabula_range.ipynb b/notebooks/Q&A/Questions_and_Answers_1_Improve_OCR_and_tabula_range.ipynb new file mode 100644 index 0000000..27d4422 --- /dev/null +++ b/notebooks/Q&A/Questions_and_Answers_1_Improve_OCR_and_tabula_range.ipynb @@ -0,0 +1,513 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Questions and Answers 2 Improve OCR and tabula range" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Question 1\n", + "\n", + "#### Extract tabular data from PDF with Python - Tabula, Camelot, PyPDF2\n", + "\n", + "https://youtu.be/702lkQbZx50\n", + "\n", + "![Question 1](../images/Selection_177.png)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(29, 4)" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from tabula import read_pdf\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=3)\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(69, 5)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# specify page range 1 to 3 page\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages='1-3')\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
BREADS & CEREALSPortion size *per 100 grams (3.5 oz)Unnamed: 3energy content
0Bagel ( 1 average )140 cals (45g)310 calsNaNMedium
1Biscuit digestives86 cals (per biscuit)480 calsNaNHigh
2Jaffa cake48 cals (per biscuit)370 calsNaNMed-High
3Bread white (thick slice)96 cals (1 slice 40g)240 calsNaNMedium
4Bread wholemeal (thick)88 cals (1 slice 40g)220 calsNaNLow-med
\n", + "
" + ], + "text/plain": [ + " BREADS & CEREALS Portion size * per 100 grams (3.5 oz) \\\n", + "0 Bagel ( 1 average ) 140 cals (45g) 310 cals \n", + "1 Biscuit digestives 86 cals (per biscuit) 480 cals \n", + "2 Jaffa cake 48 cals (per biscuit) 370 cals \n", + "3 Bread white (thick slice) 96 cals (1 slice 40g) 240 cals \n", + "4 Bread wholemeal (thick) 88 cals (1 slice 40g) 220 cals \n", + "\n", + " Unnamed: 3 energy content \n", + "0 NaN Medium \n", + "1 NaN High \n", + "2 NaN Med-High \n", + "3 NaN Medium \n", + "4 NaN Low-med " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
BREADS & CEREALSPortion size *per 100 grams (3.5 oz)Unnamed: 3energy content
64Sausage pork fried250 cals320 calsHighNaN
65Sausage pork grilled220 cals280 calsMed-HighNaN
66Sausage roll290 cals480 calsHighNaN
67Scampi fried in oil400 cals340 calsHighNaN
68Steak & kidney pie400 cals350 calsHighNaN
\n", + "
" + ], + "text/plain": [ + " BREADS & CEREALS Portion size * per 100 grams (3.5 oz) Unnamed: 3 \\\n", + "64 Sausage pork fried 250 cals 320 cals High \n", + "65 Sausage pork grilled 220 cals 280 cals Med-High \n", + "66 Sausage roll 290 cals 480 cals High \n", + "67 Scampi fried in oil 400 cals 340 cals High \n", + "68 Steak & kidney pie 400 cals 350 cals High \n", + "\n", + " energy content \n", + "64 NaN \n", + "65 NaN \n", + "66 NaN \n", + "67 NaN \n", + "68 NaN " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(69, 5)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create page range 1 to 3 page\n", + "pages=(str(1)+'-'+str(3))\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=pages)\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(69, 5)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# list all possible pages\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=[1,2,3])\n", + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(69, 5)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# list all possible pages using range\n", + "pages = list(range(1, 4))\n", + "df = read_pdf(\"http://www.uncledavesenterprise.com/file/health/Food%20Calories%20List.pdf\", pages=pages)\n", + "df.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Question 2\n", + "\n", + "#### python extract text from image or pdf\n", + "\n", + "https://youtu.be/PK-GvWWQ03g\n", + "\n", + "![Question ](../images/Selection_178.png)\n", + "\n", + "python extract text from image or pdf\n", + "\n", + "https://blog.softhints.com/python-extract-text-from-image-or-pdf/\n", + "\n", + "Improve OCR Accuracy With Advanced Image Preprocessing\n", + "\n", + "https://docparser.com/blog/improve-ocr-accuracy/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![Question ](../images/Selection_174.png)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "from PIL import Image\n", + "import pytesseract" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Java\n", + "\n", + "Python\n", + "\n", + "public class JavaPyramid1 {\n", + "public static void main(String[] args) {\n", + "for(int i=1; i<=5; i++) {\n", + "for(int j=0; j\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234
0RankCountryArea (km²)NotesNaN
11Russia*1300000017,125,191 km² including European Russia[1]NaN
22China9596961excludes Hong Kong, Macau, Taiwan and disputed...NaN
33India[2]3287263NaNNaN
44Kazakhstan*25449002,724,900 km² including European partNaN
55Saudi Arabia2149690NaNNaN
66Iran1648195NaNNaN
77Mongolia1564110NaNNaN
88Indonesia*14726391,904,569 km² including Oceanian partNaN
99Pakistan881913NaNNaN
1010Turkey*759592783,356 km² including East ThraceNaN
1111Myanmar676578NaNNaN
1212Afghanistan652230NaNNaN
1313Yemen527968NaNNaN
1414Thailand513120NaNNaN
1515Turkmenistan488100NaNNaN
1616Uzbekistan447400NaNNaN
1717Iraq438317NaNNaN
1818Japan377930NaNNaN
1919Vietnam331212NaNNaN
2020Malaysia330803NaNNaN
2121Oman309500NaNNaN
2222Philippines300000NaNNaN
2323Laos236800NaNNaN
2424Kyrgyzstan199951NaNNaN
2525Syria185180Includes the parts of the Golan HeightsNaN
2626Cambodia181035NaNNaN
2727Bangladesh147570NaNNaN
2828Nepal147181NaNNaN
2929Tajikistan143100NaNNaN
3030North Korea120538NaNNaN
3131South Korea100210NaNNaN
3232Jordan89342NaNNaN
3333Azerbaijan*86000Located in the Caucasus, between Europe and AsiaNaN
3434United Arab Emirates83600NaNNaN
3535Georgia*69000Located in the Caucasus, between Europe and AsiaNaN
3636Sri Lanka65610NaNNaN
3737Bhutan38394NaNNaN
3838Taiwan36193partially recognized state/not a UN memberNaN
3939Armenia29843Located in the Caucasus, between Europe and AsiaNaN
4040Israel22072NaNNaN
4141Kuwait17818NaNNaN
4242Timor-Leste14874NaNNaN
4343Qatar11586NaNNaN
4444Lebanon10452NaNNaN
4545Cyprus9251NaNNaN
4646Palestine6220partially recognized state/non-member observer...NaN
4747Brunei5765NaNNaN
4848Bahrain760NaNNaN
4949Singapore697NaNNaN
5050Maldives300NaNNaN
51NaNTotal44528251NaNNaN
\n", + "" + ], + "text/plain": [ + " 0 1 2 \\\n", + "0 Rank Country Area (km²) \n", + "1 1 Russia* 13000000 \n", + "2 2 China 9596961 \n", + "3 3 India[2] 3287263 \n", + "4 4 Kazakhstan* 2544900 \n", + "5 5 Saudi Arabia 2149690 \n", + "6 6 Iran 1648195 \n", + "7 7 Mongolia 1564110 \n", + "8 8 Indonesia* 1472639 \n", + "9 9 Pakistan 881913 \n", + "10 10 Turkey* 759592 \n", + "11 11 Myanmar 676578 \n", + "12 12 Afghanistan 652230 \n", + "13 13 Yemen 527968 \n", + "14 14 Thailand 513120 \n", + "15 15 Turkmenistan 488100 \n", + "16 16 Uzbekistan 447400 \n", + "17 17 Iraq 438317 \n", + "18 18 Japan 377930 \n", + "19 19 Vietnam 331212 \n", + "20 20 Malaysia 330803 \n", + "21 21 Oman 309500 \n", + "22 22 Philippines 300000 \n", + "23 23 Laos 236800 \n", + "24 24 Kyrgyzstan 199951 \n", + "25 25 Syria 185180 \n", + "26 26 Cambodia 181035 \n", + "27 27 Bangladesh 147570 \n", + "28 28 Nepal 147181 \n", + "29 29 Tajikistan 143100 \n", + "30 30 North Korea 120538 \n", + "31 31 South Korea 100210 \n", + "32 32 Jordan 89342 \n", + "33 33 Azerbaijan* 86000 \n", + "34 34 United Arab Emirates 83600 \n", + "35 35 Georgia* 69000 \n", + "36 36 Sri Lanka 65610 \n", + "37 37 Bhutan 38394 \n", + "38 38 Taiwan 36193 \n", + "39 39 Armenia 29843 \n", + "40 40 Israel 22072 \n", + "41 41 Kuwait 17818 \n", + "42 42 Timor-Leste 14874 \n", + "43 43 Qatar 11586 \n", + "44 44 Lebanon 10452 \n", + "45 45 Cyprus 9251 \n", + "46 46 Palestine 6220 \n", + "47 47 Brunei 5765 \n", + "48 48 Bahrain 760 \n", + "49 49 Singapore 697 \n", + "50 50 Maldives 300 \n", + "51 NaN Total 44528251 \n", + "\n", + " 3 4 \n", + "0 Notes NaN \n", + "1 17,125,191 km² including European Russia[1] NaN \n", + "2 excludes Hong Kong, Macau, Taiwan and disputed... NaN \n", + "3 NaN NaN \n", + "4 2,724,900 km² including European part NaN \n", + "5 NaN NaN \n", + "6 NaN NaN \n", + "7 NaN NaN \n", + "8 1,904,569 km² including Oceanian part NaN \n", + "9 NaN NaN \n", + "10 783,356 km² including East Thrace NaN \n", + "11 NaN NaN \n", + "12 NaN NaN \n", + "13 NaN NaN \n", + "14 NaN NaN \n", + "15 NaN NaN \n", + "16 NaN NaN \n", + "17 NaN NaN \n", + "18 NaN NaN \n", + "19 NaN NaN \n", + "20 NaN NaN \n", + "21 NaN NaN \n", + "22 NaN NaN \n", + "23 NaN NaN \n", + "24 NaN NaN \n", + "25 Includes the parts of the Golan Heights NaN \n", + "26 NaN NaN \n", + "27 NaN NaN \n", + "28 NaN NaN \n", + "29 NaN NaN \n", + "30 NaN NaN \n", + "31 NaN NaN \n", + "32 NaN NaN \n", + "33 Located in the Caucasus, between Europe and Asia NaN \n", + "34 NaN NaN \n", + "35 Located in the Caucasus, between Europe and Asia NaN \n", + "36 NaN NaN \n", + "37 NaN NaN \n", + "38 partially recognized state/not a UN member NaN \n", + "39 Located in the Caucasus, between Europe and Asia NaN \n", + "40 NaN NaN \n", + "41 NaN NaN \n", + "42 NaN NaN \n", + "43 NaN NaN \n", + "44 NaN NaN \n", + "45 NaN NaN \n", + "46 partially recognized state/non-member observer... NaN \n", + "47 NaN NaN \n", + "48 NaN NaN \n", + "49 NaN NaN \n", + "50 NaN NaN \n", + "51 NaN NaN " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracted 7 wikitables\n" + ] + } + ], + "source": [ + "# extract several tables from wikipedia from a single page\n", + "from pandas.io.html import read_html\n", + "page = 'https://en.wikipedia.org/wiki/New_York_City'\n", + "\n", + "wikitables = read_html(page, index_col=0, attrs={\"class\":\"wikitable\"}, header=None)\n", + "\n", + "print (\"Extracted {num} wikitables\".format(num=len(wikitables)))" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
New York City's five boroughsvteNew York City's five boroughsvte
JurisdictionJurisdictionPopulationGross Domestic ProductLand areaDensity
BoroughCountyEstimate (2018)[150]billions(US$)[151]per capita(US$)square milessquarekmpersons / sq. mipersons /km2
The BronxBronx143213242.6952920042.10109.043465313231
BrooklynKings258283091.5593460070.82183.423713714649
ManhattanNew York1628701600.24436090022.8359.137203327826
QueensQueens227890693.31039600108.53281.09214608354
Staten IslandRichmond47617914.5143030058.37151.1881123132
\n", + "
" + ], + "text/plain": [ + "New York City's five boroughsvte New York City's five boroughsvte \\\n", + "Jurisdiction Jurisdiction \n", + "Borough County \n", + "The Bronx Bronx \n", + "Brooklyn Kings \n", + "Manhattan New York \n", + "Queens Queens \n", + "Staten Island Richmond \n", + "\n", + "New York City's five boroughsvte \\\n", + "Jurisdiction Population Gross Domestic Product \n", + "Borough Estimate (2018)[150] billions(US$)[151] \n", + "The Bronx 1432132 42.695 \n", + "Brooklyn 2582830 91.559 \n", + "Manhattan 1628701 600.244 \n", + "Queens 2278906 93.310 \n", + "Staten Island 476179 14.514 \n", + "\n", + "New York City's five boroughsvte \\\n", + "Jurisdiction Land area \n", + "Borough per capita(US$) square miles squarekm \n", + "The Bronx 29200 42.10 109.04 \n", + "Brooklyn 34600 70.82 183.42 \n", + "Manhattan 360900 22.83 59.13 \n", + "Queens 39600 108.53 281.09 \n", + "Staten Island 30300 58.37 151.18 \n", + "\n", + "New York City's five boroughsvte \n", + "Jurisdiction Density \n", + "Borough persons / sq. mi persons /km2 \n", + "The Bronx 34653 13231 \n", + "Brooklyn 37137 14649 \n", + "Manhattan 72033 27826 \n", + "Queens 21460 8354 \n", + "Staten Island 8112 3132 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[0].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(17, 13)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[1].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b]vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b]
MonthJanFebMarAprMayJunJulAugSepOctNovDecYear
Record high °F (°C)72(22)78(26)86(30)96(36)99(37)101(38)106(41)104(40)102(39)94(34)84(29)75(24)106(41)
Mean maximum °F (°C)59.6(15.3)60.7(15.9)71.5(21.9)83.0(28.3)88.0(31.1)92.3(33.5)95.4(35.2)93.7(34.3)88.5(31.4)78.8(26.0)71.3(21.8)62.2(16.8)97.0(36.1)
Average high °F (°C)38.3(3.5)41.6(5.3)49.7(9.8)61.2(16.2)70.8(21.6)79.3(26.3)84.1(28.9)82.6(28.1)75.2(24.0)63.8(17.7)53.8(12.1)43.0(6.1)62.0(16.7)
Daily mean °F (°C)32.6(0.3)35.3(1.8)42.5(5.8)53.0(11.7)62.4(16.9)71.5(21.9)76.5(24.7)75.2(24.0)68.0(20.0)56.9(13.8)47.7(8.7)37.5(3.1)55.0(12.8)
Average low °F (°C)26.9(−2.8)28.9(−1.7)35.2(1.8)44.8(7.1)54.0(12.2)63.6(17.6)68.8(20.4)67.8(19.9)60.8(16.0)50.0(10.0)41.6(5.3)32.0(0.0)48.0(8.9)
Mean minimum °F (°C)9.2(−12.7)12.8(−10.7)18.5(−7.5)32.3(0.2)43.5(6.4)52.9(11.6)60.3(15.7)58.8(14.9)48.6(9.2)38.0(3.3)27.7(−2.4)15.6(−9.1)7.0(−13.9)
Record low °F (°C)−6(−21)−15(−26)3(−16)12(−11)32(0)44(7)52(11)50(10)39(4)28(−2)5(−15)−13(−25)−15(−26)
Average precipitation inches (mm)3.65(93)3.09(78)4.36(111)4.50(114)4.19(106)4.41(112)4.60(117)4.44(113)4.28(109)4.40(112)4.02(102)4.00(102)49.94(1,268)
Average snowfall inches (cm)7.0(18)9.2(23)3.9(9.9)0.6(1.5)0(0)0(0)0(0)0(0)0(0)0(0)0.3(0.76)4.8(12)25.8(66)
Average precipitation days (≥ 0.01 in)10.49.210.911.511.111.210.49.58.78.99.610.6122.0
Average snowy days (≥ 0.1 in)4.02.81.80.30000000.22.311.4
Average relative humidity (%)61.560.258.555.362.765.264.266.067.865.664.664.163.0
Mean monthly sunshine hours162.7163.1212.5225.6256.6257.3268.2268.2219.3211.2151.0139.02534.7
Percent possible sunshine54555757575759635961514857
Average ultraviolet index2346788864215
Source #1: NOAA (relative humidity and sun 1961–1990)[196][210][192][211]Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...Source #1: NOAA (relative humidity and sun 196...
Source #2: Weather Atlas[212] See Geography of New York City for additional climate information from the outer boroughs.Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...Source #2: Weather Atlas[212] See Geography of...
\n", + "
" + ], + "text/plain": [ + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Jan \n", + "Record high °F (°C) 72(22) \n", + "Mean maximum °F (°C) 59.6(15.3) \n", + "Average high °F (°C) 38.3(3.5) \n", + "Daily mean °F (°C) 32.6(0.3) \n", + "Average low °F (°C) 26.9(−2.8) \n", + "Mean minimum °F (°C) 9.2(−12.7) \n", + "Record low °F (°C) −6(−21) \n", + "Average precipitation inches (mm) 3.65(93) \n", + "Average snowfall inches (cm) 7.0(18) \n", + "Average precipitation days (≥ 0.01 in) 10.4 \n", + "Average snowy days (≥ 0.1 in) 4.0 \n", + "Average relative humidity (%) 61.5 \n", + "Mean monthly sunshine hours 162.7 \n", + "Percent possible sunshine 54 \n", + "Average ultraviolet index 2 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Feb \n", + "Record high °F (°C) 78(26) \n", + "Mean maximum °F (°C) 60.7(15.9) \n", + "Average high °F (°C) 41.6(5.3) \n", + "Daily mean °F (°C) 35.3(1.8) \n", + "Average low °F (°C) 28.9(−1.7) \n", + "Mean minimum °F (°C) 12.8(−10.7) \n", + "Record low °F (°C) −15(−26) \n", + "Average precipitation inches (mm) 3.09(78) \n", + "Average snowfall inches (cm) 9.2(23) \n", + "Average precipitation days (≥ 0.01 in) 9.2 \n", + "Average snowy days (≥ 0.1 in) 2.8 \n", + "Average relative humidity (%) 60.2 \n", + "Mean monthly sunshine hours 163.1 \n", + "Percent possible sunshine 55 \n", + "Average ultraviolet index 3 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Mar \n", + "Record high °F (°C) 86(30) \n", + "Mean maximum °F (°C) 71.5(21.9) \n", + "Average high °F (°C) 49.7(9.8) \n", + "Daily mean °F (°C) 42.5(5.8) \n", + "Average low °F (°C) 35.2(1.8) \n", + "Mean minimum °F (°C) 18.5(−7.5) \n", + "Record low °F (°C) 3(−16) \n", + "Average precipitation inches (mm) 4.36(111) \n", + "Average snowfall inches (cm) 3.9(9.9) \n", + "Average precipitation days (≥ 0.01 in) 10.9 \n", + "Average snowy days (≥ 0.1 in) 1.8 \n", + "Average relative humidity (%) 58.5 \n", + "Mean monthly sunshine hours 212.5 \n", + "Percent possible sunshine 57 \n", + "Average ultraviolet index 4 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Apr \n", + "Record high °F (°C) 96(36) \n", + "Mean maximum °F (°C) 83.0(28.3) \n", + "Average high °F (°C) 61.2(16.2) \n", + "Daily mean °F (°C) 53.0(11.7) \n", + "Average low °F (°C) 44.8(7.1) \n", + "Mean minimum °F (°C) 32.3(0.2) \n", + "Record low °F (°C) 12(−11) \n", + "Average precipitation inches (mm) 4.50(114) \n", + "Average snowfall inches (cm) 0.6(1.5) \n", + "Average precipitation days (≥ 0.01 in) 11.5 \n", + "Average snowy days (≥ 0.1 in) 0.3 \n", + "Average relative humidity (%) 55.3 \n", + "Mean monthly sunshine hours 225.6 \n", + "Percent possible sunshine 57 \n", + "Average ultraviolet index 6 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month May \n", + "Record high °F (°C) 99(37) \n", + "Mean maximum °F (°C) 88.0(31.1) \n", + "Average high °F (°C) 70.8(21.6) \n", + "Daily mean °F (°C) 62.4(16.9) \n", + "Average low °F (°C) 54.0(12.2) \n", + "Mean minimum °F (°C) 43.5(6.4) \n", + "Record low °F (°C) 32(0) \n", + "Average precipitation inches (mm) 4.19(106) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 11.1 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 62.7 \n", + "Mean monthly sunshine hours 256.6 \n", + "Percent possible sunshine 57 \n", + "Average ultraviolet index 7 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Jun \n", + "Record high °F (°C) 101(38) \n", + "Mean maximum °F (°C) 92.3(33.5) \n", + "Average high °F (°C) 79.3(26.3) \n", + "Daily mean °F (°C) 71.5(21.9) \n", + "Average low °F (°C) 63.6(17.6) \n", + "Mean minimum °F (°C) 52.9(11.6) \n", + "Record low °F (°C) 44(7) \n", + "Average precipitation inches (mm) 4.41(112) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 11.2 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 65.2 \n", + "Mean monthly sunshine hours 257.3 \n", + "Percent possible sunshine 57 \n", + "Average ultraviolet index 8 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Jul \n", + "Record high °F (°C) 106(41) \n", + "Mean maximum °F (°C) 95.4(35.2) \n", + "Average high °F (°C) 84.1(28.9) \n", + "Daily mean °F (°C) 76.5(24.7) \n", + "Average low °F (°C) 68.8(20.4) \n", + "Mean minimum °F (°C) 60.3(15.7) \n", + "Record low °F (°C) 52(11) \n", + "Average precipitation inches (mm) 4.60(117) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 10.4 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 64.2 \n", + "Mean monthly sunshine hours 268.2 \n", + "Percent possible sunshine 59 \n", + "Average ultraviolet index 8 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Aug \n", + "Record high °F (°C) 104(40) \n", + "Mean maximum °F (°C) 93.7(34.3) \n", + "Average high °F (°C) 82.6(28.1) \n", + "Daily mean °F (°C) 75.2(24.0) \n", + "Average low °F (°C) 67.8(19.9) \n", + "Mean minimum °F (°C) 58.8(14.9) \n", + "Record low °F (°C) 50(10) \n", + "Average precipitation inches (mm) 4.44(113) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 9.5 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 66.0 \n", + "Mean monthly sunshine hours 268.2 \n", + "Percent possible sunshine 63 \n", + "Average ultraviolet index 8 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Sep \n", + "Record high °F (°C) 102(39) \n", + "Mean maximum °F (°C) 88.5(31.4) \n", + "Average high °F (°C) 75.2(24.0) \n", + "Daily mean °F (°C) 68.0(20.0) \n", + "Average low °F (°C) 60.8(16.0) \n", + "Mean minimum °F (°C) 48.6(9.2) \n", + "Record low °F (°C) 39(4) \n", + "Average precipitation inches (mm) 4.28(109) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 8.7 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 67.8 \n", + "Mean monthly sunshine hours 219.3 \n", + "Percent possible sunshine 59 \n", + "Average ultraviolet index 6 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Oct \n", + "Record high °F (°C) 94(34) \n", + "Mean maximum °F (°C) 78.8(26.0) \n", + "Average high °F (°C) 63.8(17.7) \n", + "Daily mean °F (°C) 56.9(13.8) \n", + "Average low °F (°C) 50.0(10.0) \n", + "Mean minimum °F (°C) 38.0(3.3) \n", + "Record low °F (°C) 28(−2) \n", + "Average precipitation inches (mm) 4.40(112) \n", + "Average snowfall inches (cm) 0(0) \n", + "Average precipitation days (≥ 0.01 in) 8.9 \n", + "Average snowy days (≥ 0.1 in) 0 \n", + "Average relative humidity (%) 65.6 \n", + "Mean monthly sunshine hours 211.2 \n", + "Percent possible sunshine 61 \n", + "Average ultraviolet index 4 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Nov \n", + "Record high °F (°C) 84(29) \n", + "Mean maximum °F (°C) 71.3(21.8) \n", + "Average high °F (°C) 53.8(12.1) \n", + "Daily mean °F (°C) 47.7(8.7) \n", + "Average low °F (°C) 41.6(5.3) \n", + "Mean minimum °F (°C) 27.7(−2.4) \n", + "Record low °F (°C) 5(−15) \n", + "Average precipitation inches (mm) 4.02(102) \n", + "Average snowfall inches (cm) 0.3(0.76) \n", + "Average precipitation days (≥ 0.01 in) 9.6 \n", + "Average snowy days (≥ 0.1 in) 0.2 \n", + "Average relative humidity (%) 64.6 \n", + "Mean monthly sunshine hours 151.0 \n", + "Percent possible sunshine 51 \n", + "Average ultraviolet index 2 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \\\n", + "Month Dec \n", + "Record high °F (°C) 75(24) \n", + "Mean maximum °F (°C) 62.2(16.8) \n", + "Average high °F (°C) 43.0(6.1) \n", + "Daily mean °F (°C) 37.5(3.1) \n", + "Average low °F (°C) 32.0(0.0) \n", + "Mean minimum °F (°C) 15.6(−9.1) \n", + "Record low °F (°C) −13(−25) \n", + "Average precipitation inches (mm) 4.00(102) \n", + "Average snowfall inches (cm) 4.8(12) \n", + "Average precipitation days (≥ 0.01 in) 10.6 \n", + "Average snowy days (≥ 0.1 in) 2.3 \n", + "Average relative humidity (%) 64.1 \n", + "Mean monthly sunshine hours 139.0 \n", + "Percent possible sunshine 48 \n", + "Average ultraviolet index 1 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... \n", + "\n", + "vteClimate data for New York (Belvedere Castle, Central Park), 1981–2010 normals,[a] extremes 1869–present[b] \n", + "Month Year \n", + "Record high °F (°C) 106(41) \n", + "Mean maximum °F (°C) 97.0(36.1) \n", + "Average high °F (°C) 62.0(16.7) \n", + "Daily mean °F (°C) 55.0(12.8) \n", + "Average low °F (°C) 48.0(8.9) \n", + "Mean minimum °F (°C) 7.0(−13.9) \n", + "Record low °F (°C) −15(−26) \n", + "Average precipitation inches (mm) 49.94(1,268) \n", + "Average snowfall inches (cm) 25.8(66) \n", + "Average precipitation days (≥ 0.01 in) 122.0 \n", + "Average snowy days (≥ 0.1 in) 11.4 \n", + "Average relative humidity (%) 63.0 \n", + "Mean monthly sunshine hours 2534.7 \n", + "Percent possible sunshine 57 \n", + "Average ultraviolet index 5 \n", + "Source #1: NOAA (relative humidity and sun 1961... Source #1: NOAA (relative humidity and sun 196... \n", + "Source #2: Weather Atlas[212] See Geography of ... Source #2: Weather Atlas[212] See Geography of... " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
New York City's five boroughsvteNew York City's five boroughsvte
JurisdictionJurisdictionPopulationGross Domestic ProductLand areaDensity
BoroughCountyEstimate (2018)[150]billions(US$)[151]per capita(US$)square milessquarekmpersons / sq. mipersons /km2
The BronxBronx143213242.6952920042.10109.043465313231
BrooklynKings258283091.5593460070.82183.423713714649
ManhattanNew York1628701600.24436090022.8359.137203327826
QueensQueens227890693.31039600108.53281.09214608354
Staten IslandRichmond47617914.5143030058.37151.1881123132
\n", + "
" + ], + "text/plain": [ + "New York City's five boroughsvte New York City's five boroughsvte \\\n", + "Jurisdiction Jurisdiction \n", + "Borough County \n", + "The Bronx Bronx \n", + "Brooklyn Kings \n", + "Manhattan New York \n", + "Queens Queens \n", + "Staten Island Richmond \n", + "\n", + "New York City's five boroughsvte \\\n", + "Jurisdiction Population Gross Domestic Product \n", + "Borough Estimate (2018)[150] billions(US$)[151] \n", + "The Bronx 1432132 42.695 \n", + "Brooklyn 2582830 91.559 \n", + "Manhattan 1628701 600.244 \n", + "Queens 2278906 93.310 \n", + "Staten Island 476179 14.514 \n", + "\n", + "New York City's five boroughsvte \\\n", + "Jurisdiction Land area \n", + "Borough per capita(US$) square miles squarekm \n", + "The Bronx 29200 42.10 109.04 \n", + "Brooklyn 34600 70.82 183.42 \n", + "Manhattan 360900 22.83 59.13 \n", + "Queens 39600 108.53 281.09 \n", + "Staten Island 30300 58.37 151.18 \n", + "\n", + "New York City's five boroughsvte \n", + "Jurisdiction Density \n", + "Borough persons / sq. mi persons /km2 \n", + "The Bronx 34653 13231 \n", + "Brooklyn 37137 14649 \n", + "Manhattan 72033 27826 \n", + "Queens 21460 8354 \n", + "Staten Island 8112 3132 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[0].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracted 1 wikitables\n" + ] + } + ], + "source": [ + "# change the index table\n", + "from pandas.io.html import read_html\n", + "page = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'\n", + "\n", + "wikitables = read_html(page, index_col=1, attrs={\"class\":\"wikitable\"})\n", + "\n", + "print (\"Extracted {num} wikitables\".format(num=len(wikitables)))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
023456
1
Country or areaRankUN continentalregion[2]UN statisticalregion[2]Population(1 July 2016)[3]Population(1 July 2017)[3]Change
World74669642807550262101+1.1%
China[a]1AsiaEastern Asia14035003651409517397+0.4%
India2AsiaSouthern Asia13241713541339180127+1.1%
United States3AmericasNorthern America322179605324459463+0.7%
\n", + "
" + ], + "text/plain": [ + " 0 2 3 \\\n", + "1 \n", + "Country or area Rank UN continentalregion[2] UN statisticalregion[2] \n", + "World — — — \n", + "China[a] 1 Asia Eastern Asia \n", + "India 2 Asia Southern Asia \n", + "United States 3 Americas Northern America \n", + "\n", + " 4 5 \\\n", + "1 \n", + "Country or area Population(1 July 2016)[3] Population(1 July 2017)[3] \n", + "World 7466964280 7550262101 \n", + "China[a] 1403500365 1409517397 \n", + "India 1324171354 1339180127 \n", + "United States 322179605 324459463 \n", + "\n", + " 6 \n", + "1 \n", + "Country or area Change \n", + "World +1.1% \n", + "China[a] +0.4% \n", + "India +1.1% \n", + "United States +0.7% " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[0].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracted 1 wikitables\n" + ] + } + ], + "source": [ + "# works with different languages ( option encoding is available if needed)\n", + "from pandas.io.html import read_html\n", + "page = 'https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E5%9B%BD%E5%AE%B6%E5%92%8C%E5%9C%B0%E5%8C%BA%E4%BA%BA%E5%8F%A3%E6%8E%92%E5%90%8D%E5%88%97%E8%A1%A8'\n", + "\n", + "wikitables = read_html(page, index_col=0, attrs={\"class\":\"wikitable\"})\n", + "\n", + "print (\"Extracted {num} wikitables\".format(num=len(wikitables)))" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
123456
0
排名国家或者地区大洲[2]統計地區[2]人口(2016年7月1日)[3]人口(2017年7月1日)[3]变化率
世界74669642807550262101+1.1%
1中华人民共和国[a]亚洲东亚14035003651409517397+0.4%
2印度亚洲南亚13241713541339180127+1.1%
3美國美洲北美322179605324459463+0.7%
\n", + "
" + ], + "text/plain": [ + " 1 2 3 4 5 6\n", + "0 \n", + "排名 国家或者地区 大洲[2] 統計地區[2] 人口(2016年7月1日)[3] 人口(2017年7月1日)[3] 变化率\n", + "— 世界 — — 7466964280 7550262101 +1.1%\n", + "1 中华人民共和国[a] 亚洲 东亚 1403500365 1409517397 +0.4%\n", + "2 印度 亚洲 南亚 1324171354 1339180127 +1.1%\n", + "3 美國 美洲 北美 322179605 324459463 +0.7%" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "wikitables[0].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Read wiki Infoboxes" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracted 1 infoboxes\n", + "Extracted 4 wikitables\n" + ] + } + ], + "source": [ + "from pandas.io.html import read_html\n", + "page = 'https://en.wikipedia.org/wiki/University_of_California,_Berkeley'\n", + "infoboxes = read_html(page, index_col=0, attrs={\"class\":\"infobox\"})\n", + "wikitables = read_html(page, index_col=0, attrs={\"class\":\"wikitable\"})\n", + "\n", + "print (\"Extracted {num} infoboxes\".format(num=len(infoboxes)))\n", + "print (\"Extracted {num} wikitables\".format(num=len(wikitables)))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
1
0
University rankingsNaN
NationalNaN
ARWU[106]4.0
Forbes[107]14.0
U.S. News & World Report[108]22.0
Washington Monthly[109]7.0
GlobalNaN
ARWU[110]5.0
QS[111]27.0
Times[112]15.0
U.S. News & World Report[113]4.0
\n", + "
" + ], + "text/plain": [ + " 1\n", + "0 \n", + "University rankings NaN\n", + "National NaN\n", + "ARWU[106] 4.0\n", + "Forbes[107] 14.0\n", + "U.S. News & World Report[108] 22.0\n", + "Washington Monthly[109] 7.0\n", + "Global NaN\n", + "ARWU[110] 5.0\n", + "QS[111] 27.0\n", + "Times[112] 15.0\n", + "U.S. News & World Report[113] 4.0" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "infoboxes[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Extracted 1 infoboxes\n", + "Extracted 1 wikitables\n" + ] + } + ], + "source": [ + "from pandas.io.html import read_html\n", + "page = 'https://en.wikipedia.org/wiki/Lisbon'\n", + "infoboxes = read_html(page, index_col=0, attrs={\"class\":\"infobox geography vcard\"})\n", + "wikitables = read_html(page, index_col=0, attrs={\"class\":\"wikitable\"})\n", + "\n", + "print (\"Extracted {num} infoboxes\".format(num=len(infoboxes)))\n", + "print (\"Extracted {num} wikitables\".format(num=len(wikitables)))" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
12
0
CountryPortugalNaN
NUTS II RegionLisbon metropolitan areaNaN
NUTS III SubregionLisbon metropolitan areaNaN
DistrictLisbonNaN
MunicipalityLisbonNaN
SettlementPrior to Roman ruleNaN
Cityc. 1256NaN
Civil parishes(see text)NaN
GovernmentNaNNaN
• TypeLAUNaN
\n", + "
" + ], + "text/plain": [ + " 1 2\n", + "0 \n", + "Country Portugal NaN\n", + "NUTS II Region Lisbon metropolitan area NaN\n", + "NUTS III Subregion Lisbon metropolitan area NaN\n", + "District Lisbon NaN\n", + "Municipality Lisbon NaN\n", + "Settlement Prior to Roman rule NaN\n", + "City c. 1256 NaN\n", + "Civil parishes (see text) NaN\n", + "Government NaN NaN\n", + "• Type LAU NaN" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "infoboxes[0][10:20]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scrape non wiki tables" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
1234567
0
NaNPlayer IDPlayer NameTotal (Overall)NaNHighest Paying GameTotal (Game)% of Total
1.0KuroKyKuro Takhasomi$4,136,926.95|Dota 2$4,135,203.6199.96%
2.0N0tailJohan Sundstein$3,742,055.59|Dota 2$3,733,903.9899.78%
3.0Miracle-Amer Al-Barkawi$3,701,337.28|Dota 2$3,701,337.28100.00%
4.0MinD_ContRoLIvan Ivanov$3,492,411.76|Dota 2$3,492,411.76100.00%
5.0MatumbamanLasse Urpalainen$3,476,116.04|Dota 2$3,476,116.04100.00%
6.0JerAxJesse Vainikka$3,313,463.82|Dota 2$3,313,463.82100.00%
7.0SumaiLSumail Hassan$3,305,914.94|Dota 2$3,305,914.94100.00%
8.0GHMaroun Merhej$3,095,344.84|Dota 2$3,095,344.84100.00%
9.0UNiVeRsESaahil Arora$3,035,737.67|Dota 2$3,035,737.67100.00%
\n", + "
" + ], + "text/plain": [ + " 1 2 3 4 \\\n", + "0 \n", + "NaN Player ID Player Name Total (Overall) NaN \n", + " 1.0 KuroKy Kuro Takhasomi $4,136,926.95 | \n", + " 2.0 N0tail Johan Sundstein $3,742,055.59 | \n", + " 3.0 Miracle- Amer Al-Barkawi $3,701,337.28 | \n", + " 4.0 MinD_ContRoL Ivan Ivanov $3,492,411.76 | \n", + " 5.0 Matumbaman Lasse Urpalainen $3,476,116.04 | \n", + " 6.0 JerAx Jesse Vainikka $3,313,463.82 | \n", + " 7.0 SumaiL Sumail Hassan $3,305,914.94 | \n", + " 8.0 GH Maroun Merhej $3,095,344.84 | \n", + " 9.0 UNiVeRsE Saahil Arora $3,035,737.67 | \n", + "\n", + " 5 6 7 \n", + "0 \n", + "NaN Highest Paying Game Total (Game) % of Total \n", + " 1.0 Dota 2 $4,135,203.61 99.96% \n", + " 2.0 Dota 2 $3,733,903.98 99.78% \n", + " 3.0 Dota 2 $3,701,337.28 100.00% \n", + " 4.0 Dota 2 $3,492,411.76 100.00% \n", + " 5.0 Dota 2 $3,476,116.04 100.00% \n", + " 6.0 Dota 2 $3,313,463.82 100.00% \n", + " 7.0 Dota 2 $3,305,914.94 100.00% \n", + " 8.0 Dota 2 $3,095,344.84 100.00% \n", + " 9.0 Dota 2 $3,035,737.67 100.00% " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from pandas.io.html import read_html\n", + "page = 'https://www.esportsearnings.com/players'\n", + "infoboxes = read_html(page, index_col=0, attrs={\"class\":\"detail_list_table\"})\n", + "\n", + "infoboxes[0].head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert html tables to csv/excel" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "from pandas.io.html import read_html\n", + "page = 'https://www.esportsearnings.com/players'\n", + "infoboxes = read_html(page, index_col=0, attrs={\"class\":\"detail_list_table\"})\n", + "\n", + "file_name = './my_file.csv'\n", + "infoboxes[0].to_csv(file_name, sep='\\t')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "./my_file.csv\r\n", + "./csv/movie_metadata.csv\r\n" + ] + } + ], + "source": [ + "!find . -type f -name \"*.csv\" " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Web Scraping Wikipedia Tables using BeautifulSoup and Python\n", + "\n", + "source: https://github.com/stewync/Web-Scraping-Wiki-tables-using-BeautifulSoup-and-Python/blob/master/Scraping%2BWiki%2Btable%2Busing%2BPython%2Band%2BBeautifulSoup.ipynb" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "website_url = requests.get('https://en.wikipedia.org/wiki/List_of_UFC_events').text" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "from bs4 import BeautifulSoup\n", + "soup = BeautifulSoup(website_url,'lxml')" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "My_table = soup.find('table',{'class':'wikitable'})" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "links = My_table.findAll('a')" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "events = []\n", + "for link in links:\n", + " events.append(link.get('title')) " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
events
0UFC on ESPN 4
1TBA
2None
3UFC on ESPN+ 11
4TBA
\n", + "
" + ], + "text/plain": [ + " events\n", + "0 UFC on ESPN 4\n", + "1 TBA\n", + "2 None\n", + "3 UFC on ESPN+ 11\n", + "4 TBA" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "df = pd.DataFrame()\n", + "df['events'] = events\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Other" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### wiki-table-scrape\n", + "https://github.com/rocheio/wiki-table-scrape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scraping Wikipedia Tables with Python\n", + "https://roche.io/2016/05/scrape-wikipedia-with-python" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/What_is_the_usage_of_*_asterisk_in_Python.ipynb b/notebooks/What_is_the_usage_of_*_asterisk_in_Python.ipynb new file mode 100644 index 0000000..e7c1c50 --- /dev/null +++ b/notebooks/What_is_the_usage_of_*_asterisk_in_Python.ipynb @@ -0,0 +1,385 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# What is the usage of * - asterisk in Python\n", + "\n", + "* For multiplication and power operations.\n", + "* Extending collections\n", + "* Unpacking\n", + "* positional arguments and keyword arguments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## For multiplication and power operations." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "30" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "5 * 6" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "2 ** 2" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax (, line 1)", + "traceback": [ + "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m 2 *** 2\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" + ], + "output_type": "error" + } + ], + "source": [ + "2 *** 2" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'aaaaa'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'a' * 5" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'ffffff'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "'fff' * 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extending collections" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[0] * 20 " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[0, 1 , 2] * 5" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[[0, 1, 2], [3], [0, 1, 2], [3]]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[[0, 1 , 2], [3]] * 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Unpacking" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 3, 5, 7, 9]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "odds = [1, 3, 5, 7, 9]\n", + "*x, = odds\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 3, 5, 7]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "*x,y = odds\n", + "x" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "x, *y, z = odds" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[3, 5, 7]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(1, 3, 5, 7, 9)\n", + "([1, 3, 5, 7, 9],)\n" + ] + } + ], + "source": [ + "odds = [1, 3, 5, 7, 9]\n", + "\n", + "def sum_all(*numbers):\n", + " print(numbers)\n", + "\n", + "sum_all(*odds)\n", + "\n", + "sum_all(odds)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## positional arguments and keyword arguments" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('x', 'y', 'z', 'w', 'v')\n" + ] + } + ], + "source": [ + "def print_all(*args):\n", + " print(args) \n", + "print_all('x', 'y', 'z', 'w', 'v')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'x': 'x', 'y': 'y', 'z': 'z', 'w': 'w', 'v': 'v'}\n" + ] + } + ], + "source": [ + "def print_all(**kwargs):\n", + " print(kwargs)\n", + "print_all(x='x', y='y', z='z', w='w', v='v')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/csv/data.csv.zip b/notebooks/csv/data.csv.zip new file mode 100644 index 0000000..1acfcc2 Binary files /dev/null and b/notebooks/csv/data.csv.zip differ diff --git a/notebooks/csv/data_201901.csv b/notebooks/csv/data_201901.csv new file mode 100644 index 0000000..ec916f4 --- /dev/null +++ b/notebooks/csv/data_201901.csv @@ -0,0 +1,3 @@ +col1,col2,col3 +A,B,1 +AA,BB,2 \ No newline at end of file diff --git a/notebooks/csv/data_201902.csv b/notebooks/csv/data_201902.csv new file mode 100644 index 0000000..223cfe2 --- /dev/null +++ b/notebooks/csv/data_201902.csv @@ -0,0 +1,3 @@ +col1,col2,col3 +C,D,3 +CC,DD,4 \ No newline at end of file diff --git a/notebooks/csv/data_202001.csv b/notebooks/csv/data_202001.csv new file mode 100644 index 0000000..52bdb1d --- /dev/null +++ b/notebooks/csv/data_202001.csv @@ -0,0 +1,3 @@ +col1,col2,col3,col4 +E,F,5,e5 +EE,FF,6,ee6 \ No newline at end of file diff --git a/notebooks/csv/data_202002.csv b/notebooks/csv/data_202002.csv new file mode 100644 index 0000000..56194e0 --- /dev/null +++ b/notebooks/csv/data_202002.csv @@ -0,0 +1,3 @@ +col1,col2,col3,col5 +H,J,7,77 +HH,JJ,8,88 \ No newline at end of file diff --git a/notebooks/csv/excel/example.xlsx b/notebooks/csv/excel/example.xlsx new file mode 100644 index 0000000..d58d686 Binary files /dev/null and b/notebooks/csv/excel/example.xlsx differ diff --git a/notebooks/pandas/20._Pandas_-_value_counts_-_multiple_columns%2C_all_columns_and_bad_data.ipynb b/notebooks/pandas/20._Pandas_-_value_counts_-_multiple_columns%2C_all_columns_and_bad_data.ipynb new file mode 100644 index 0000000..48f8734 --- /dev/null +++ b/notebooks/pandas/20._Pandas_-_value_counts_-_multiple_columns%2C_all_columns_and_bad_data.ipynb @@ -0,0 +1,1372 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 20. Pandas - value_counts - multiple columns, all columns and bad data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df_movie = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df_resp = pd.read_csv(\"../csv/other_text_responses.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Pandas apply value_counts on multiple columns at once" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',\n", + " 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',\n", + " 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',\n", + " 'movie_title', 'num_voted_users', 'cast_total_facebook_likes',\n", + " 'actor_3_name', 'facenumber_in_poster', 'plot_keywords',\n", + " 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',\n", + " 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',\n", + " 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],\n", + " dtype='object')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_movie.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0.00NaNNaNNaNNaN907.089.0NaN26.0NaNNaN...NaNNaNNaNNaNNaNNaN55.0NaNNaN2181.0
1.00NaNNaN43.0NaNNaNNaNNaNNaNNaNNaN...51.0NaNNaNNaNNaNNaNNaNNaNNaNNaN
1.18NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaN1.0NaN
1.20NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaN1.0NaN
1.33NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaN68.0NaN
\n", + "

5 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0.00 NaN NaN NaN NaN \n", + "1.00 NaN NaN 43.0 NaN \n", + "1.18 NaN NaN NaN NaN \n", + "1.20 NaN NaN NaN NaN \n", + "1.33 NaN NaN NaN NaN \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0.00 907.0 89.0 NaN \n", + "1.00 NaN NaN NaN \n", + "1.18 NaN NaN NaN \n", + "1.20 NaN NaN NaN \n", + "1.33 NaN NaN NaN \n", + "\n", + " actor_1_facebook_likes gross genres ... num_user_for_reviews \\\n", + "0.00 26.0 NaN NaN ... NaN \n", + "1.00 NaN NaN NaN ... 51.0 \n", + "1.18 NaN NaN NaN ... NaN \n", + "1.20 NaN NaN NaN ... NaN \n", + "1.33 NaN NaN NaN ... NaN \n", + "\n", + " language country content_rating budget title_year \\\n", + "0.00 NaN NaN NaN NaN NaN \n", + "1.00 NaN NaN NaN NaN NaN \n", + "1.18 NaN NaN NaN NaN NaN \n", + "1.20 NaN NaN NaN NaN NaN \n", + "1.33 NaN NaN NaN NaN NaN \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \n", + "0.00 55.0 NaN NaN 2181.0 \n", + "1.00 NaN NaN NaN NaN \n", + "1.18 NaN NaN 1.0 NaN \n", + "1.20 NaN NaN 1.0 NaN \n", + "1.33 NaN NaN 68.0 NaN \n", + "\n", + "[5 rows x 28 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_movie.apply(pd.Series.value_counts).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(37410, 28)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_movie.apply(pd.Series.value_counts).shape" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colorcontent_rating
Black and White209.0NaN
ApprovedNaN55.0
Color4815.0NaN
GNaN112.0
GPNaN6.0
MNaN5.0
NC-17NaN7.0
Not RatedNaN116.0
PGNaN701.0
PG-13NaN1461.0
PassedNaN9.0
RNaN2118.0
TV-14NaN30.0
TV-GNaN10.0
TV-MANaN20.0
TV-PGNaN13.0
TV-YNaN1.0
TV-Y7NaN1.0
UnratedNaN62.0
XNaN13.0
\n", + "
" + ], + "text/plain": [ + " color content_rating\n", + " Black and White 209.0 NaN\n", + "Approved NaN 55.0\n", + "Color 4815.0 NaN\n", + "G NaN 112.0\n", + "GP NaN 6.0\n", + "M NaN 5.0\n", + "NC-17 NaN 7.0\n", + "Not Rated NaN 116.0\n", + "PG NaN 701.0\n", + "PG-13 NaN 1461.0\n", + "Passed NaN 9.0\n", + "R NaN 2118.0\n", + "TV-14 NaN 30.0\n", + "TV-G NaN 10.0\n", + "TV-MA NaN 20.0\n", + "TV-PG NaN 13.0\n", + "TV-Y NaN 1.0\n", + "TV-Y7 NaN 1.0\n", + "Unrated NaN 62.0\n", + "X NaN 13.0" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_movie[['color', 'content_rating']].apply(pd.Series.value_counts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Pandas apply value_counts on all columns" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q12_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "LinkedIn 39\n", + "Medium 38\n", + "Linkedin 16\n", + "Coursera 16\n", + "Books 14\n", + "medium 11\n", + "Facebook 11\n", + "linkedin 9\n", + "books 8\n", + "Data Science Central 7\n", + "Name: Q12_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q13_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "mlcourse.ai 35\n", + "NPTEL 23\n", + "Youtube 22\n", + "Simplilearn 18\n", + "Pluralsight 17\n", + "Stepik 14\n", + "Data Science Academy 12\n", + "youtube 12\n", + "Springboard 11\n", + "Books 10\n", + "Name: Q13_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Python 89\n", + "python 45\n", + "None 36\n", + "Matlab 28\n", + "none 22\n", + "R 13\n", + "SQL 13\n", + "MATLAB 11\n", + "matlab 11\n", + "Python 9\n", + "Name: Q14_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_Part_1_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Excel 865\n", + "Microsoft Excel 392\n", + "excel 263\n", + "MS Excel 67\n", + "Google Sheets 61\n", + "Google sheets 44\n", + "Microsoft excel 38\n", + "Excel 33\n", + "microsoft excel 27\n", + "EXCEL 25\n", + "Name: Q14_Part_1_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_Part_2_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "SAS 129\n", + "SPSS 116\n", + "R 60\n", + "spss 34\n", + "Spss 25\n", + "Python 21\n", + "Sas 18\n", + "Stata 15\n", + "python 14\n", + "R, Python 11\n", + "Name: Q14_Part_2_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_Part_3_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Tableau 260\n", + "Power BI 71\n", + "tableau 51\n", + "PowerBI 23\n", + "Salesforce 19\n", + "Tableau 16\n", + "Qlik 10\n", + "Power Bi 9\n", + "Spotfire 8\n", + "SAP 6\n", + "Name: Q14_Part_3_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_Part_4_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "JupyterLab 943\n", + "Jupyter 602\n", + "RStudio 516\n", + "Python 301\n", + "Jupyter Notebook 275\n", + "Rstudio 225\n", + "Jupyterlab 184\n", + "jupyter 183\n", + "Jupyter notebook 170\n", + "python 163\n", + "Name: Q14_Part_4_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q14_Part_5_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "AWS 203\n", + "GCP 134\n", + "Azure 87\n", + "aws 40\n", + "Aws 26\n", + "Google Colab 25\n", + "Databricks 17\n", + "gcp 15\n", + "Gcp 12\n", + "Colab 11\n", + "Name: Q14_Part_5_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q16_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Eclipse 47\n", + "IntelliJ 17\n", + "Intellij 14\n", + "eclipse 13\n", + "SAS 11\n", + "Google Colab 11\n", + "IntelliJ IDEA 10\n", + "Colab 9\n", + "Anaconda 9\n", + "Xcode 8\n", + "Name: Q16_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q17_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Databricks 43\n", + "databricks 12\n", + "Github 12\n", + "Anaconda 8\n", + "Domino Data Lab 5\n", + "Zeppelin 5\n", + "Domino 4\n", + "Jupyter Notebook 4\n", + "Anaconda 3\n", + "Jupyter 3\n", + "Name: Q17_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q18_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "C# 198\n", + "Scala 100\n", + "SAS 79\n", + "Julia 46\n", + "PHP 43\n", + "VBA 30\n", + "Ruby 27\n", + "c# 27\n", + "Go 27\n", + "Swift 20\n", + "Name: Q18_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q19_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Julia 22\n", + "Scala 9\n", + "C# 6\n", + "SAS 4\n", + "Swift 4\n", + "Octave 4\n", + "scala 2\n", + "Rust 2\n", + "mathematica 2\n", + "julia 2\n", + "Name: Q19_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q20_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Tableau 31\n", + "Excel 13\n", + "MATLAB 12\n", + "Power BI 10\n", + "Pandas 8\n", + "tableau 6\n", + "pandas 5\n", + "PowerBI 5\n", + "Dash 4\n", + "Spotfire 4\n", + "Name: Q20_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q21_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "FPGA 7\n", + "Laptop 4\n", + "Fpga 2\n", + "Planning to use GPUs 1\n", + "spark databricks on AWS 1\n", + "Edge neurocomputing chips like Intel's NCS. 1\n", + "but i wana to use gpu 1\n", + "Intel NCS2 1\n", + "Parallel comp with MPI 1\n", + "paperspace uses GPU's I believe... I prefer them figuring that part out 1\n", + "Name: Q21_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q24_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "SVM 32\n", + "Support Vector Machines 11\n", + "KNN 7\n", + "Clustering 6\n", + "svm 4\n", + "Support Vector Machine 4\n", + "SVM, KNN 4\n", + "SVMs 4\n", + "Support vector machine 3\n", + "KMeans 2\n", + "Name: Q24_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q25_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "DataRobot 12\n", + "catalyst 7\n", + "Catalyst 6\n", + "Microsoft ML 2\n", + "Datarobot 2\n", + "sklearn 2\n", + "fastai 2\n", + "Weka 1\n", + "Stepwise regression 1\n", + "SAS, ScykitLearn 1\n", + "Name: Q25_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q26_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Handwriting Recognize tools 1\n", + "Custom Built Tools 1\n", + "everytime daily constantly, never stoping, never ceasing to fail with measurable risk and damage to never stop knowing the perfect limit of our capacity and perfection. Cancerous attitude towards ourselves but victorious for our comformists. 1\n", + "Anomaly detection on videos 1\n", + "SSD-Keras 1\n", + "Fast.ai 1\n", + "Wavenet 1\n", + "openCV 1\n", + "GIS 1\n", + "text processing 1\n", + "Name: Q26_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q27_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "SpaCy 2\n", + "Text mining by R and Python libraries only 1\n", + "Retrofitting 1\n", + "OWL 1\n", + "Flair 1\n", + "Making your own 1\n", + "Stopwords, Lemmatization, TF-IDF, BoW 1\n", + "fastai 1\n", + "svm 1\n", + "FastAI 1\n", + "Name: Q27_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q28_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Catboost 24\n", + "CatBoost 14\n", + "catboost 13\n", + "h2o 11\n", + "H2O 10\n", + "MATLAB 9\n", + "Chainer 6\n", + "MXNet 4\n", + "Catalyst 4\n", + "Caffe 4\n", + "Name: Q28_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q29_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Digital Ocean 5\n", + "DigitalOcean 4\n", + "Tencent Cloud 3\n", + "DataRobot 2\n", + "Google Colab 2\n", + "Private cloud 2\n", + "paperspace 2\n", + "SAS Cloud 2\n", + "Databricks 2\n", + "OVH 2\n", + "Name: Q29_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q2_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "non-binary 4\n", + "Attack Helicopter 2\n", + "bionicle 2\n", + "T-rex shaped meteor made out of cheese 1\n", + "Pharoah 1\n", + "queer 1\n", + "Lvl 129 Dust Devil 1\n", + "What is your gender? - Prefer to self-describe - Text 1\n", + "genderfluid helicopter 1\n", + "none 1\n", + "Name: Q2_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q30_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "IBM Cloud 7\n", + "Databricks 6\n", + "AWS EMR 4\n", + "AWS SageMaker 4\n", + "AWS S3 3\n", + "Azure Databricks 3\n", + "IBM Watson 3\n", + "Inhouse 2\n", + "AWS Fargate 2\n", + "OpenShift 2\n", + "Name: Q30_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q31_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Snowflake 13\n", + "SAS 6\n", + "Spark 4\n", + "Cloudera 4\n", + "Hadoop 4\n", + "DataRobot 4\n", + "IBM Cloud Pak for Data 3\n", + "Splunk 3\n", + "Apache Spark 3\n", + "Tableau 2\n", + "Name: Q31_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q32_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "DataRobot 15\n", + "Knime 11\n", + "KNIME 8\n", + "IBM Watson Studio 7\n", + "Alteryx 4\n", + "MATLAB 4\n", + "IBM Cloud 3\n", + "H2O 3\n", + "Datarobot 3\n", + "IBM Watson 3\n", + "Name: Q32_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q33_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "IBM AutoAI 6\n", + "Prevision.io 4\n", + "H2O AutoML 3\n", + "H20 AutoML 2\n", + "H2O.ai AutoML 2\n", + "SAS 2\n", + "prevision.io 2\n", + "Which automated machine learning tools (or partial AutoML tools) do you use on a regular basis? (Select all that apply) - Other - Text 1\n", + "Watson ML 1\n", + "custom 1\n", + "Name: Q33_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q34_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Snowflake 18\n", + "DB2 17\n", + "MongoDB 15\n", + "Teradata 12\n", + "IBM DB2 7\n", + "Mongo 6\n", + "SAP HANA 6\n", + "MariaDB 5\n", + "SAS 5\n", + "Hadoop 4\n", + "Name: Q34_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q5_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "Professor 43\n", + "Machine Learning Engineer 32\n", + "Teacher 19\n", + "Consultant 19\n", + "Lecturer 14\n", + "CTO 13\n", + "CEO 13\n", + "Engineer 12\n", + "Mechanical Engineer 11\n", + "Solution Architect 11\n", + "Name: Q5_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "----------------------------------------Q9_OTHER_TEXT---------------------------------------- - " + ] + }, + { + "data": { + "text/plain": [ + "i'm professor 1\n", + "I am a Technical Project Manager and I involve with customer product owner and my teams to analyse and suggest business decisions. 1\n", + "Architecture 1\n", + "Model methodology development 1\n", + "Produce data driven research 1\n", + "human-centered data science research 1\n", + "\"> 1\n", + "Analyze business systems and processes; recommend solutions. 1\n", + "Support a product that provides analytics & ML libraries 1\n", + "Conceptualize workflows and design experiments 1\n", + "Name: Q9_OTHER_TEXT, dtype: int64" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# List top values per question\n", + "for col in df_resp.columns:\n", + " print('-' * 40 + col + '-' * 40 , end=' - ')\n", + " display(df_resp[col].value_counts().head(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TypeError: unhashable type: 'dict'\n", + "df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}], 'c':[[1],[2],[3]]})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Pandas apply value_counts on column with bad data" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Power BI', 'PowerBI', 'Power Bi']" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import difflib \n", + "difflib.get_close_matches('Power BI', ['Power BI', 'tableau', 'PowerBI', 'Power Bi','Salesforce', 'Tableau ', 'Qlik', 'Power bi'], n=3, cutoff=0.6)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "import difflib \n", + "\n", + "correct_values = {}\n", + "words = df_resp.Q14_Part_3_TEXT.value_counts(ascending=True).index\n", + "\n", + "for keyword in words:\n", + " similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6)\n", + " for x in similar:\n", + " correct_values[x] = keyword\n", + " \n", + "df_resp[\"corr\"] = df_resp[\"Q14_Part_3_TEXT\"].map(correct_values)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Tableau 345\n", + "Power BI 137\n", + "Salesforce 43\n", + "Qlik 27\n", + "Spotfire 17\n", + " ... \n", + "tableau which is very fast and easy to analyse 1\n", + "We use Tableau to analyse through histograms,bargraphs and many more tools in Tableau 1\n", + "Izenda, Excel, XtraReports 1\n", + "ssrs 1\n", + "XLcubed 1\n", + "Name: corr, Length: 179, dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_resp[\"corr\"].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Tableau 260\n", + "Power BI 71\n", + "tableau 51\n", + "PowerBI 23\n", + "Salesforce 19\n", + " ... \n", + "Domo 1\n", + "MySQL Client, Tableau 1\n", + "Datastudio 1\n", + "Abinitio 1\n", + "XLcubed 1\n", + "Name: Q14_Part_3_TEXT, Length: 339, dtype: int64" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_resp[\"Q14_Part_3_TEXT\"].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/21. pandas-dataframe-sampling-rows-or-columns.ipynb b/notebooks/pandas/21. pandas-dataframe-sampling-rows-or-columns.ipynb new file mode 100644 index 0000000..35e9f2b --- /dev/null +++ b/notebooks/pandas/21. pandas-dataframe-sampling-rows-or-columns.ipynb @@ -0,0 +1,3681 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 21. Pandas - Random Sample of a subset of a dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Random sampling of rows, columns from DataFrame with sample()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
2525ColorJoel Schumacher106.098.0541.071.0David Murray214.01569918.0Biography|Crime|Drama|Thriller...113.0EnglishIrelandR17000000.02003.096.06.92.350
\n", + "

1 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "2525 Color Joel Schumacher 106.0 98.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "2525 541.0 71.0 David Murray \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "2525 214.0 1569918.0 Biography|Crime|Drama|Thriller ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "2525 113.0 English Ireland R 17000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "2525 2003.0 96.0 6.9 2.35 \n", + "\n", + " movie_facebook_likes \n", + "2525 0 \n", + "\n", + "[1 rows x 28 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Default behavior of sample()\n", + "df.sample()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
535ColorDennis Dugan179.0102.0221.04000.0Adam Sandler12000.0162001186.0Comedy...311.0EnglishUSAPG-1380000000.02010.011000.06.01.8512000
2987ColorFred Schepisi61.0109.040.0794.0Ray Winstone5000.02326407.0Drama...99.0EnglishUKR12000000.02001.01000.07.02.35305
1475ColorDavid Koepp248.091.0192.0346.0Dania Ramirez23000.020275446.0Action|Crime|Thriller...178.0EnglishUSAPG-1335000000.02012.01000.06.52.3520000
\n", + "

3 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "535 Color Dennis Dugan 179.0 102.0 \n", + "2987 Color Fred Schepisi 61.0 109.0 \n", + "1475 Color David Koepp 248.0 91.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "535 221.0 4000.0 Adam Sandler \n", + "2987 40.0 794.0 Ray Winstone \n", + "1475 192.0 346.0 Dania Ramirez \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "535 12000.0 162001186.0 Comedy ... \n", + "2987 5000.0 2326407.0 Drama ... \n", + "1475 23000.0 20275446.0 Action|Crime|Thriller ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "535 311.0 English USA PG-13 80000000.0 \n", + "2987 99.0 English UK R 12000000.0 \n", + "1475 178.0 English USA PG-13 35000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "535 2010.0 11000.0 6.0 1.85 \n", + "2987 2001.0 1000.0 7.0 2.35 \n", + "1475 2012.0 1000.0 6.5 2.35 \n", + "\n", + " movie_facebook_likes \n", + "535 12000 \n", + "2987 305 \n", + "1475 20000 \n", + "\n", + "[3 rows x 28 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# return n rows\n", + "df.sample(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colornum_critic_for_reviewstitle_year
0Color723.02009.0
1Color302.02007.0
2Color602.02015.0
3Color813.02012.0
4NaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " color num_critic_for_reviews title_year\n", + "0 Color 723.0 2009.0\n", + "1 Color 302.0 2007.0\n", + "2 Color 602.0 2015.0\n", + "3 Color 813.0 2012.0\n", + "4 NaN NaN NaN" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# columns\n", + "df.sample(3, axis=1).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
2389ColorTerry Gilliam156.0118.00.0551.0Michael Jeter40000.010562387.0Adventure|Comedy|Drama...648.0EnglishUSAR18500000.01998.0693.07.72.3515000
4233Black and WhiteTay Garnett7.0119.010.0275.0Greer Garson509.0NaNDrama...29.0EnglishUSAPassed2160000.01945.0284.07.51.3768
4737ColorGreg Harrison46.086.07.017.0Ari Gold328.01114943.0Drama|Music...74.0EnglishUSAR500000.02000.027.06.51.850
3717ColorMike Flanagan336.0104.059.0202.0Rory Cochrane972.027689474.0Horror|Mystery...339.0EnglishUSAR5000000.02013.0407.06.52.3523000
1854ColorGarry Marshall200.0113.00.0307.0Common22000.054540525.0Comedy|Romance...134.0EnglishUSAPG-1356000000.02011.0988.05.71.8520000
\n", + "

5 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "2389 Color Terry Gilliam 156.0 118.0 \n", + "4233 Black and White Tay Garnett 7.0 119.0 \n", + "4737 Color Greg Harrison 46.0 86.0 \n", + "3717 Color Mike Flanagan 336.0 104.0 \n", + "1854 Color Garry Marshall 200.0 113.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "2389 0.0 551.0 Michael Jeter \n", + "4233 10.0 275.0 Greer Garson \n", + "4737 7.0 17.0 Ari Gold \n", + "3717 59.0 202.0 Rory Cochrane \n", + "1854 0.0 307.0 Common \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "2389 40000.0 10562387.0 Adventure|Comedy|Drama ... \n", + "4233 509.0 NaN Drama ... \n", + "4737 328.0 1114943.0 Drama|Music ... \n", + "3717 972.0 27689474.0 Horror|Mystery ... \n", + "1854 22000.0 54540525.0 Comedy|Romance ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "2389 648.0 English USA R 18500000.0 \n", + "4233 29.0 English USA Passed 2160000.0 \n", + "4737 74.0 English USA R 500000.0 \n", + "3717 339.0 English USA R 5000000.0 \n", + "1854 134.0 English USA PG-13 56000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "2389 1998.0 693.0 7.7 2.35 \n", + "4233 1945.0 284.0 7.5 1.37 \n", + "4737 2000.0 27.0 6.5 1.85 \n", + "3717 2013.0 407.0 6.5 2.35 \n", + "1854 2011.0 988.0 5.7 1.85 \n", + "\n", + " movie_facebook_likes \n", + "2389 15000 \n", + "4233 68 \n", + "4737 0 \n", + "3717 23000 \n", + "1854 20000 \n", + "\n", + "[5 rows x 28 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The fraction of rows and columns: frac\n", + "df.sample(frac=0.001)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
4232ColorHayley Cloake11.090.00.0306.0Kayla Ewell676.0NaNThriller...9.0EnglishUSAR2200000.02008.0399.04.3NaN77
4877ColorTom Seidman4.098.03.0104.0Derek Brandon337.0NaNDrama|Family...10.0EnglishUSAPG250000.02010.0168.06.2NaN0
3399ColorMike Leigh248.0129.0608.0386.0Imelda Staunton1000.03205244.0Comedy|Drama...141.0EnglishUKPG-1310000000.02010.0579.07.32.350
\n", + "

3 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "4232 Color Hayley Cloake 11.0 90.0 \n", + "4877 Color Tom Seidman 4.0 98.0 \n", + "3399 Color Mike Leigh 248.0 129.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "4232 0.0 306.0 Kayla Ewell \n", + "4877 3.0 104.0 Derek Brandon \n", + "3399 608.0 386.0 Imelda Staunton \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "4232 676.0 NaN Thriller ... \n", + "4877 337.0 NaN Drama|Family ... \n", + "3399 1000.0 3205244.0 Comedy|Drama ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "4232 9.0 English USA R 2200000.0 \n", + "4877 10.0 English USA PG 250000.0 \n", + "3399 141.0 English UK PG-13 10000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "4232 2008.0 399.0 4.3 NaN \n", + "4877 2010.0 168.0 6.2 NaN \n", + "3399 2010.0 579.0 7.3 2.35 \n", + "\n", + " movie_facebook_likes \n", + "4232 77 \n", + "4877 0 \n", + "3399 0 \n", + "\n", + "[3 rows x 28 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sample with seed\n", + "df.sample(n=3, random_state=5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. p.random.choice" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
893ColorGabriele Muccino202.0123.0125.0835.0Rosario Dawson10000.069951824.0Drama|Romance...599.0EnglishUSAPG-1355000000.02008.03000.07.72.3526000
818ColorBrian Levant57.090.032.0809.0Mark Addy1000.035231365.0Comedy|Family|Romance|Sci-Fi...85.0EnglishUSAPG60000000.02000.0891.03.61.85500
460ColorSimon West139.0123.0165.0744.0Monica Potter12000.0101087161.0Action|Crime|Thriller...339.0EnglishUSAR75000000.01997.0878.06.82.350
772ColorClint Eastwood306.0134.016000.0204.0Morgan Freeman13000.037479778.0Biography|Drama|History|Sport...259.0EnglishUSAPG-1360000000.02009.011000.07.42.3523000
269ColorLen Wiseman354.0129.0235.0297.0Jonathan Sadowski13000.0134520804.0Action|Adventure|Thriller...782.0EnglishUSAPG-13110000000.02007.0300.07.22.350
\n", + "

5 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "893 Color Gabriele Muccino 202.0 123.0 \n", + "818 Color Brian Levant 57.0 90.0 \n", + "460 Color Simon West 139.0 123.0 \n", + "772 Color Clint Eastwood 306.0 134.0 \n", + "269 Color Len Wiseman 354.0 129.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "893 125.0 835.0 Rosario Dawson \n", + "818 32.0 809.0 Mark Addy \n", + "460 165.0 744.0 Monica Potter \n", + "772 16000.0 204.0 Morgan Freeman \n", + "269 235.0 297.0 Jonathan Sadowski \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "893 10000.0 69951824.0 Drama|Romance ... \n", + "818 1000.0 35231365.0 Comedy|Family|Romance|Sci-Fi ... \n", + "460 12000.0 101087161.0 Action|Crime|Thriller ... \n", + "772 13000.0 37479778.0 Biography|Drama|History|Sport ... \n", + "269 13000.0 134520804.0 Action|Adventure|Thriller ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "893 599.0 English USA PG-13 55000000.0 \n", + "818 85.0 English USA PG 60000000.0 \n", + "460 339.0 English USA R 75000000.0 \n", + "772 259.0 English USA PG-13 60000000.0 \n", + "269 782.0 English USA PG-13 110000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "893 2008.0 3000.0 7.7 2.35 \n", + "818 2000.0 891.0 3.6 1.85 \n", + "460 1997.0 878.0 6.8 2.35 \n", + "772 2009.0 11000.0 7.4 2.35 \n", + "269 2007.0 300.0 7.2 2.35 \n", + "\n", + " movie_facebook_likes \n", + "893 26000 \n", + "818 500 \n", + "460 0 \n", + "772 23000 \n", + "269 0 \n", + "\n", + "[5 rows x 28 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# nu,py choice for DataFrame sampling\n", + "import numpy as np\n", + "chosen_idx = np.random.choice(1000, replace=False, size=5)\n", + "df.iloc[chosen_idx]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Random sample of rows based on column values" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Color - " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
2693ColorAlbert Brooks97.097.0745.0745.0Bradley Whitford12000.011614236.0Comedy...140.0EnglishUSAPG-1315000000.01999.0821.05.61.37251
1613ColorAlexander Payne217.0125.0729.0322.0June Squibb442.065010106.0Comedy|Drama...612.0EnglishUSAR30000000.02002.0344.07.21.850
698ColorLawrence Kasdan40.0212.0759.0812.0Catherine O'Hara2000.025052000.0Adventure|Biography|Crime|Drama|Western...145.0EnglishUSAPG-1363000000.01994.0925.06.62.350
\n", + "

3 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "2693 Color Albert Brooks 97.0 97.0 \n", + "1613 Color Alexander Payne 217.0 125.0 \n", + "698 Color Lawrence Kasdan 40.0 212.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "2693 745.0 745.0 Bradley Whitford \n", + "1613 729.0 322.0 June Squibb \n", + "698 759.0 812.0 Catherine O'Hara \n", + "\n", + " actor_1_facebook_likes gross \\\n", + "2693 12000.0 11614236.0 \n", + "1613 442.0 65010106.0 \n", + "698 2000.0 25052000.0 \n", + "\n", + " genres ... num_user_for_reviews \\\n", + "2693 Comedy ... 140.0 \n", + "1613 Comedy|Drama ... 612.0 \n", + "698 Adventure|Biography|Crime|Drama|Western ... 145.0 \n", + "\n", + " language country content_rating budget title_year \\\n", + "2693 English USA PG-13 15000000.0 1999.0 \n", + "1613 English USA R 30000000.0 2002.0 \n", + "698 English USA PG-13 63000000.0 1994.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \n", + "2693 821.0 5.6 1.37 251 \n", + "1613 344.0 7.2 1.85 0 \n", + "698 925.0 6.6 2.35 0 \n", + "\n", + "[3 rows x 28 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Black and White - " + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
4786Black and WhiteLloyd Bacon65.089.024.045.0Dick Powell610.02300000.0Comedy|Musical|Romance...97.0EnglishUSAUnrated439000.01933.0105.07.71.37439
3983Black and WhiteJohn Schlesinger88.0113.0154.077.0Barnard Hughes183.0NaNDrama...334.0EnglishUSAX3600000.01969.089.07.91.850
479Black and WhiteNaN31.025.0NaN474.0Agnes Moorehead1000.0NaNComedy|Family|Fantasy...71.0EnglishUSATV-GNaNNaN960.07.64.000
\n", + "

3 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "4786 Black and White Lloyd Bacon 65.0 89.0 \n", + "3983 Black and White John Schlesinger 88.0 113.0 \n", + "479 Black and White NaN 31.0 25.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "4786 24.0 45.0 Dick Powell \n", + "3983 154.0 77.0 Barnard Hughes \n", + "479 NaN 474.0 Agnes Moorehead \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "4786 610.0 2300000.0 Comedy|Musical|Romance ... \n", + "3983 183.0 NaN Drama ... \n", + "479 1000.0 NaN Comedy|Family|Fantasy ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "4786 97.0 English USA Unrated 439000.0 \n", + "3983 334.0 English USA X 3600000.0 \n", + "479 71.0 English USA TV-G NaN \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "4786 1933.0 105.0 7.7 1.37 \n", + "3983 1969.0 89.0 7.9 1.85 \n", + "479 NaN 960.0 7.6 4.00 \n", + "\n", + " movie_facebook_likes \n", + "4786 439 \n", + "3983 0 \n", + "479 0 \n", + "\n", + "[3 rows x 28 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# conditional DataFrame sampling few values - separate display \n", + "col = 'color'\n", + "for typ in list(df[col].dropna().unique()):\n", + " print(typ, end=' - ')\n", + " display(df[df[col] == typ].sample(3))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['PG-13', 'PG', 'G', 'R', 'TV-14', 'TV-PG', 'TV-MA', 'TV-G', 'Not Rated', 'Unrated', 'Approved', 'TV-Y', 'NC-17', 'X', 'TV-Y7', 'GP', 'Passed', 'M']\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
3758ColorBrian Dannelly121.092.012.0797.0Patrick Fugit3000.08786715.0Comedy|Drama...324.0EnglishUSAPG-135000000.02004.0835.06.91.850
64ColorAndrew Adamson284.0150.080.082.0Kiran Shah1000.0291709845.0Adventure|Family|Fantasy...1463.0EnglishUSAPG180000000.02005.0190.06.92.350
4725ColorJoe Camp5.086.024.0142.0Peter Breck407.039552600.0Adventure|Family|Romance...36.0EnglishUSAG500000.01974.0189.06.11.85816
2955ColorFranklin J. Schaffner55.0125.076.0801.0Anne Meara1000.0NaNDrama|Thriller...123.0EnglishUKR12000000.01978.0837.07.01.850
3509ColorNaN11.060.0NaN652.0Ashley Scott10000.0NaNAction|Drama|Mystery|Sci-Fi...160.0EnglishUSATV-14NaNNaN794.07.41.330
4803ColorNaN11.022.0NaN6.0Ron Lynch59.0NaNAnimation|Comedy|Drama...82.0EnglishUSATV-PGNaNNaN11.08.21.33526
826ColorNaN46.030.0NaN479.0Kristin Davis962.0NaNComedy|Romance...238.0EnglishUSATV-MANaNNaN722.07.01.330
3880ColorKenny Ortega57.098.0197.0578.0Corbin Bleu755.0NaNComedy|Drama|Family|Music|Musical|Romance...726.0EnglishUSATV-G4200000.02006.0632.05.21.330
4328Black and WhiteOrson Welles90.092.00.018.0Everett Sloane1000.07927.0Crime|Drama|Film-Noir|Mystery|Thriller...175.0EnglishUSANot Rated2300000.01947.029.07.71.370
4997ColorDavid Gordon Green75.090.0234.015.0Eddie Rouse552.0241816.0Drama...76.0EnglishUSAUnrated42000.02000.061.07.52.35451
4497Black and WhiteWalter Lang7.083.09.051.0Nigel Bruce94.0NaNDrama|Family|Fantasy...25.0EnglishUSAApprovedNaN1940.062.06.51.37548
1265ColorNaN3.030.0NaN12.0Melissa Altro51.0NaNAnimation|Comedy|Family...43.0EnglishCanadaTV-YNaNNaN21.07.41.33301
5025ColorJohn Waters73.0108.00.0105.0Mink Stole462.0180483.0Comedy|Crime|Horror...183.0EnglishUSANC-1710000.01972.0143.06.11.370
3559ColorBrian De Palma121.0104.00.0517.0David Margulies754.031899000.0Mystery|Romance|Thriller...201.0EnglishUSAX6500000.01980.0567.07.12.350
1972ColorNaN7.030.0NaN265.0Jennifer Hale971.0NaNAction|Animation|Comedy|Family|Fantasy|Sci-Fi...60.0EnglishUSATV-Y7NaNNaN918.07.24.00581
4529ColorDouglas Trumbull87.089.0136.042.0Ron Rifkin844.0NaNDrama|Sci-Fi...199.0EnglishUSAGP1000000.01972.0184.06.71.850
4812Black and WhiteHarry Beaumont36.0100.04.04.0Bessie Love77.02808000.0Musical|Romance...71.0EnglishUSAPassed379000.01929.028.06.31.37167
3584ColorGeorge Roy Hill130.0110.0131.0399.0Ted Cassidy640.0102308900.0Biography|Crime|Drama|Western...309.0EnglishUSAM6000000.01969.0566.08.12.350
\n", + "

18 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews \\\n", + "3758 Color Brian Dannelly 121.0 \n", + "64 Color Andrew Adamson 284.0 \n", + "4725 Color Joe Camp 5.0 \n", + "2955 Color Franklin J. Schaffner 55.0 \n", + "3509 Color NaN 11.0 \n", + "4803 Color NaN 11.0 \n", + "826 Color NaN 46.0 \n", + "3880 Color Kenny Ortega 57.0 \n", + "4328 Black and White Orson Welles 90.0 \n", + "4997 Color David Gordon Green 75.0 \n", + "4497 Black and White Walter Lang 7.0 \n", + "1265 Color NaN 3.0 \n", + "5025 Color John Waters 73.0 \n", + "3559 Color Brian De Palma 121.0 \n", + "1972 Color NaN 7.0 \n", + "4529 Color Douglas Trumbull 87.0 \n", + "4812 Black and White Harry Beaumont 36.0 \n", + "3584 Color George Roy Hill 130.0 \n", + "\n", + " duration director_facebook_likes actor_3_facebook_likes \\\n", + "3758 92.0 12.0 797.0 \n", + "64 150.0 80.0 82.0 \n", + "4725 86.0 24.0 142.0 \n", + "2955 125.0 76.0 801.0 \n", + "3509 60.0 NaN 652.0 \n", + "4803 22.0 NaN 6.0 \n", + "826 30.0 NaN 479.0 \n", + "3880 98.0 197.0 578.0 \n", + "4328 92.0 0.0 18.0 \n", + "4997 90.0 234.0 15.0 \n", + "4497 83.0 9.0 51.0 \n", + "1265 30.0 NaN 12.0 \n", + "5025 108.0 0.0 105.0 \n", + "3559 104.0 0.0 517.0 \n", + "1972 30.0 NaN 265.0 \n", + "4529 89.0 136.0 42.0 \n", + "4812 100.0 4.0 4.0 \n", + "3584 110.0 131.0 399.0 \n", + "\n", + " actor_2_name actor_1_facebook_likes gross \\\n", + "3758 Patrick Fugit 3000.0 8786715.0 \n", + "64 Kiran Shah 1000.0 291709845.0 \n", + "4725 Peter Breck 407.0 39552600.0 \n", + "2955 Anne Meara 1000.0 NaN \n", + "3509 Ashley Scott 10000.0 NaN \n", + "4803 Ron Lynch 59.0 NaN \n", + "826 Kristin Davis 962.0 NaN \n", + "3880 Corbin Bleu 755.0 NaN \n", + "4328 Everett Sloane 1000.0 7927.0 \n", + "4997 Eddie Rouse 552.0 241816.0 \n", + "4497 Nigel Bruce 94.0 NaN \n", + "1265 Melissa Altro 51.0 NaN \n", + "5025 Mink Stole 462.0 180483.0 \n", + "3559 David Margulies 754.0 31899000.0 \n", + "1972 Jennifer Hale 971.0 NaN \n", + "4529 Ron Rifkin 844.0 NaN \n", + "4812 Bessie Love 77.0 2808000.0 \n", + "3584 Ted Cassidy 640.0 102308900.0 \n", + "\n", + " genres ... num_user_for_reviews \\\n", + "3758 Comedy|Drama ... 324.0 \n", + "64 Adventure|Family|Fantasy ... 1463.0 \n", + "4725 Adventure|Family|Romance ... 36.0 \n", + "2955 Drama|Thriller ... 123.0 \n", + "3509 Action|Drama|Mystery|Sci-Fi ... 160.0 \n", + "4803 Animation|Comedy|Drama ... 82.0 \n", + "826 Comedy|Romance ... 238.0 \n", + "3880 Comedy|Drama|Family|Music|Musical|Romance ... 726.0 \n", + "4328 Crime|Drama|Film-Noir|Mystery|Thriller ... 175.0 \n", + "4997 Drama ... 76.0 \n", + "4497 Drama|Family|Fantasy ... 25.0 \n", + "1265 Animation|Comedy|Family ... 43.0 \n", + "5025 Comedy|Crime|Horror ... 183.0 \n", + "3559 Mystery|Romance|Thriller ... 201.0 \n", + "1972 Action|Animation|Comedy|Family|Fantasy|Sci-Fi ... 60.0 \n", + "4529 Drama|Sci-Fi ... 199.0 \n", + "4812 Musical|Romance ... 71.0 \n", + "3584 Biography|Crime|Drama|Western ... 309.0 \n", + "\n", + " language country content_rating budget title_year \\\n", + "3758 English USA PG-13 5000000.0 2004.0 \n", + "64 English USA PG 180000000.0 2005.0 \n", + "4725 English USA G 500000.0 1974.0 \n", + "2955 English UK R 12000000.0 1978.0 \n", + "3509 English USA TV-14 NaN NaN \n", + "4803 English USA TV-PG NaN NaN \n", + "826 English USA TV-MA NaN NaN \n", + "3880 English USA TV-G 4200000.0 2006.0 \n", + "4328 English USA Not Rated 2300000.0 1947.0 \n", + "4997 English USA Unrated 42000.0 2000.0 \n", + "4497 English USA Approved NaN 1940.0 \n", + "1265 English Canada TV-Y NaN NaN \n", + "5025 English USA NC-17 10000.0 1972.0 \n", + "3559 English USA X 6500000.0 1980.0 \n", + "1972 English USA TV-Y7 NaN NaN \n", + "4529 English USA GP 1000000.0 1972.0 \n", + "4812 English USA Passed 379000.0 1929.0 \n", + "3584 English USA M 6000000.0 1969.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \n", + "3758 835.0 6.9 1.85 0 \n", + "64 190.0 6.9 2.35 0 \n", + "4725 189.0 6.1 1.85 816 \n", + "2955 837.0 7.0 1.85 0 \n", + "3509 794.0 7.4 1.33 0 \n", + "4803 11.0 8.2 1.33 526 \n", + "826 722.0 7.0 1.33 0 \n", + "3880 632.0 5.2 1.33 0 \n", + "4328 29.0 7.7 1.37 0 \n", + "4997 61.0 7.5 2.35 451 \n", + "4497 62.0 6.5 1.37 548 \n", + "1265 21.0 7.4 1.33 301 \n", + "5025 143.0 6.1 1.37 0 \n", + "3559 567.0 7.1 2.35 0 \n", + "1972 918.0 7.2 4.00 581 \n", + "4529 184.0 6.7 1.85 0 \n", + "4812 28.0 6.3 1.37 167 \n", + "3584 566.0 8.1 2.35 0 \n", + "\n", + "[18 rows x 28 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# conditional DataFrame sampling many values - grouped display \n", + "col = 'content_rating'\n", + "sample = []\n", + "\n", + "variants = list(df[col].dropna().unique())\n", + "print(variants)\n", + "\n", + "for typ in variants:\n", + " sample.append(df[df[col] == typ].sample())\n", + "pd.concat(sample)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Dataframe sampling with numpy and weights" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
2425Black and WhiteMartin Scorsese151.0121.017000.0356.0Cathy Moriarty22000.045250.0Biography|Drama|Sport...EnglishUSAR18000000.01980.0394.08.31.8501.0
1125Black and WhiteOliver Stone83.0212.00.0805.0Bob Hoskins12000.013560960.0Biography|Drama|History...EnglishUSAR50000000.01995.05000.07.12.359151.0
3539NaNRichard Rich2.045.024.029.0Kate Higgins122.0NaNAction|Adventure|Animation|Comedy|Drama|Family......NaNUSANaN7000000.02014.035.06.0NaN411.0
2944Black and WhiteMartin Campbell400.0144.0258.0834.0Tobias Menzies6000.0167007184.0Action|Adventure|Thriller...EnglishUKPG-13150000000.02006.01000.08.02.3501.0
4359Black and WhiteStanley Kubrick192.095.00.0277.0Slim Pickens654.0NaNComedy...EnglishUSAPG1800000.01964.0575.08.51.66180001.0
\n", + "

5 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "2425 Black and White Martin Scorsese 151.0 121.0 \n", + "1125 Black and White Oliver Stone 83.0 212.0 \n", + "3539 NaN Richard Rich 2.0 45.0 \n", + "2944 Black and White Martin Campbell 400.0 144.0 \n", + "4359 Black and White Stanley Kubrick 192.0 95.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "2425 17000.0 356.0 Cathy Moriarty \n", + "1125 0.0 805.0 Bob Hoskins \n", + "3539 24.0 29.0 Kate Higgins \n", + "2944 258.0 834.0 Tobias Menzies \n", + "4359 0.0 277.0 Slim Pickens \n", + "\n", + " actor_1_facebook_likes gross \\\n", + "2425 22000.0 45250.0 \n", + "1125 12000.0 13560960.0 \n", + "3539 122.0 NaN \n", + "2944 6000.0 167007184.0 \n", + "4359 654.0 NaN \n", + "\n", + " genres ... language country \\\n", + "2425 Biography|Drama|Sport ... English USA \n", + "1125 Biography|Drama|History ... English USA \n", + "3539 Action|Adventure|Animation|Comedy|Drama|Family... ... NaN USA \n", + "2944 Action|Adventure|Thriller ... English UK \n", + "4359 Comedy ... English USA \n", + "\n", + " content_rating budget title_year actor_2_facebook_likes \\\n", + "2425 R 18000000.0 1980.0 394.0 \n", + "1125 R 50000000.0 1995.0 5000.0 \n", + "3539 NaN 7000000.0 2014.0 35.0 \n", + "2944 PG-13 150000000.0 2006.0 1000.0 \n", + "4359 PG 1800000.0 1964.0 575.0 \n", + "\n", + " imdb_score aspect_ratio movie_facebook_likes weights \n", + "2425 8.3 1.85 0 1.0 \n", + "1125 7.1 2.35 915 1.0 \n", + "3539 6.0 NaN 41 1.0 \n", + "2944 8.0 2.35 0 1.0 \n", + "4359 8.5 1.66 18000 1.0 \n", + "\n", + "[5 rows x 29 columns]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# excluding 'Color' values by applying weights 0 - Color and 1 - rest\n", + "df['weights'] = np.where(df['color'] == 'Color', .0, 1)\n", + "df.sample(frac=.001, weights='weights')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
3083ColorFloyd Mutrux5.099.011.0327.0Kelli Williams665.0125169.0Comedy|Drama...EnglishUSAR10500000.01994.0446.06.4NaN1191.0
3435ColorGeorge Tillman Jr.34.0115.088.0890.0Mekhi Phifer1000.043490057.0Comedy|Drama...EnglishUSAR7500000.01997.01000.06.91.855081.0
3940ColorRenny Harlin68.0102.0212.0195.0Lane Smith10000.0354704.0Crime|Drama|Horror|Thriller...EnglishUSAR1300000.01987.0633.05.91.853141.0
5006ColorDamir CaticNaN89.02.00.0Ron Gelner5.0NaNHorror...EnglishUSANot Rated60000.02013.00.05.4NaN481.0
1872ColorMichael Hoffman85.0118.097.0437.0Gerald McRaney775.026761283.0Drama|Romance...EnglishUSAPG-1326000000.02014.0523.06.72.35190001.0
\n", + "

5 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "3083 Color Floyd Mutrux 5.0 99.0 \n", + "3435 Color George Tillman Jr. 34.0 115.0 \n", + "3940 Color Renny Harlin 68.0 102.0 \n", + "5006 Color Damir Catic NaN 89.0 \n", + "1872 Color Michael Hoffman 85.0 118.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "3083 11.0 327.0 Kelli Williams \n", + "3435 88.0 890.0 Mekhi Phifer \n", + "3940 212.0 195.0 Lane Smith \n", + "5006 2.0 0.0 Ron Gelner \n", + "1872 97.0 437.0 Gerald McRaney \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "3083 665.0 125169.0 Comedy|Drama ... \n", + "3435 1000.0 43490057.0 Comedy|Drama ... \n", + "3940 10000.0 354704.0 Crime|Drama|Horror|Thriller ... \n", + "5006 5.0 NaN Horror ... \n", + "1872 775.0 26761283.0 Drama|Romance ... \n", + "\n", + " language country content_rating budget title_year \\\n", + "3083 English USA R 10500000.0 1994.0 \n", + "3435 English USA R 7500000.0 1997.0 \n", + "3940 English USA R 1300000.0 1987.0 \n", + "5006 English USA Not Rated 60000.0 2013.0 \n", + "1872 English USA PG-13 26000000.0 2014.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \\\n", + "3083 446.0 6.4 NaN 119 \n", + "3435 1000.0 6.9 1.85 508 \n", + "3940 633.0 5.9 1.85 314 \n", + "5006 0.0 5.4 NaN 48 \n", + "1872 523.0 6.7 2.35 19000 \n", + "\n", + " weights \n", + "3083 1.0 \n", + "3435 1.0 \n", + "3940 1.0 \n", + "5006 1.0 \n", + "1872 1.0 \n", + "\n", + "[5 rows x 29 columns]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Including only 'Color' values by applying weights 1 - Color and 0 - rest\n", + "df['weights'] = np.where(df['color'] == 'Color', 1, 0.0)\n", + "df.sample(frac=.001, weights='weights')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0])" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.where(df['color'] == 'Color', 1, 0)[270:280]" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Color', 'Color', ' Black and White', 'Color', 'Color', 'Color', 'Color', 'Color', 'Color', nan]\n" + ] + } + ], + "source": [ + "print(list(df['color'][270:280]))" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2153 PG\n", + "2238 PG-13\n", + "836 PG-13\n", + "4725 G\n", + "3205 PG\n", + "3388 PG-13\n", + "152 PG-13\n", + "2574 PG-13\n", + "2724 PG-13\n", + "1854 PG-13\n", + "Name: content_rating, dtype: object" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Including/excluding list of values - with equal probability\n", + "df['weights'] = np.where(df['content_rating'].isin(['PG-13', 'PG', 'G']), 1, 0)\n", + "df.sample(frac=.002, weights='weights')['content_rating']" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "PG-13 1461\n", + "PG 701\n", + "G 112\n", + "Name: content_rating, dtype: int64" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['content_rating'].isin(['PG-13', 'PG', 'G'])]['content_rating'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Pandas sample rows by group" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
color
Black and White2734Black and WhiteFritz Lang260.0145.0756.018.0Gustav Fröhlich136.026435.0Drama|Sci-Fi...GermanGermanyNot Rated6000000.01927.023.08.31.33120000
4741Black and WhiteMorgan J. Freeman17.086.0204.0474.0Heather Matarazzo659.0334041.0Crime|Drama|Romance...EnglishUSAR500000.01997.0529.06.51.85510
1979Black and WhiteNeil Jordan44.0133.0277.08000.0Liam Neeson25000.011030963.0Biography|Drama|Thriller|War...EnglishUKR28000000.01996.014000.07.11.8500
Color3436ColorStanley Tong62.089.07.036.0Anita Mui186.032333860.0Action|Comedy...CantoneseHong KongR7500000.01995.0147.06.72.3500
734ColorOliver Stone171.0156.00.01000.0Dennis Quaid14000.075530832.0Drama|Sport...EnglishUSAR55000000.01999.02000.06.82.3500
2336ColorAndrew Jarecki140.0101.046.0902.0Kirsten Dunst33000.0578382.0Crime|Drama|Mystery|Romance|Thriller...EnglishUSARNaN2010.04000.06.31.8500
\n", + "

6 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name \\\n", + "color \n", + " Black and White 2734 Black and White Fritz Lang \n", + " 4741 Black and White Morgan J. Freeman \n", + " 1979 Black and White Neil Jordan \n", + "Color 3436 Color Stanley Tong \n", + " 734 Color Oliver Stone \n", + " 2336 Color Andrew Jarecki \n", + "\n", + " num_critic_for_reviews duration \\\n", + "color \n", + " Black and White 2734 260.0 145.0 \n", + " 4741 17.0 86.0 \n", + " 1979 44.0 133.0 \n", + "Color 3436 62.0 89.0 \n", + " 734 171.0 156.0 \n", + " 2336 140.0 101.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes \\\n", + "color \n", + " Black and White 2734 756.0 18.0 \n", + " 4741 204.0 474.0 \n", + " 1979 277.0 8000.0 \n", + "Color 3436 7.0 36.0 \n", + " 734 0.0 1000.0 \n", + " 2336 46.0 902.0 \n", + "\n", + " actor_2_name actor_1_facebook_likes gross \\\n", + "color \n", + " Black and White 2734 Gustav Fröhlich 136.0 26435.0 \n", + " 4741 Heather Matarazzo 659.0 334041.0 \n", + " 1979 Liam Neeson 25000.0 11030963.0 \n", + "Color 3436 Anita Mui 186.0 32333860.0 \n", + " 734 Dennis Quaid 14000.0 75530832.0 \n", + " 2336 Kirsten Dunst 33000.0 578382.0 \n", + "\n", + " genres ... language \\\n", + "color ... \n", + " Black and White 2734 Drama|Sci-Fi ... German \n", + " 4741 Crime|Drama|Romance ... English \n", + " 1979 Biography|Drama|Thriller|War ... English \n", + "Color 3436 Action|Comedy ... Cantonese \n", + " 734 Drama|Sport ... English \n", + " 2336 Crime|Drama|Mystery|Romance|Thriller ... English \n", + "\n", + " country content_rating budget title_year \\\n", + "color \n", + " Black and White 2734 Germany Not Rated 6000000.0 1927.0 \n", + " 4741 USA R 500000.0 1997.0 \n", + " 1979 UK R 28000000.0 1996.0 \n", + "Color 3436 Hong Kong R 7500000.0 1995.0 \n", + " 734 USA R 55000000.0 1999.0 \n", + " 2336 USA R NaN 2010.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "color \n", + " Black and White 2734 23.0 8.3 1.33 \n", + " 4741 529.0 6.5 1.85 \n", + " 1979 14000.0 7.1 1.85 \n", + "Color 3436 147.0 6.7 2.35 \n", + " 734 2000.0 6.8 2.35 \n", + " 2336 4000.0 6.3 1.85 \n", + "\n", + " movie_facebook_likes weights \n", + "color \n", + " Black and White 2734 12000 0 \n", + " 4741 51 0 \n", + " 1979 0 0 \n", + "Color 3436 0 0 \n", + " 734 0 0 \n", + " 2336 0 0 \n", + "\n", + "[6 rows x 29 columns]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby('color').apply(lambda x: x.sample(n=3))" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
0Black and WhiteSteven Zaillian127.0128.0234.0581.0Anthony Hopkins14000.07221458.0Drama|Thriller...EnglishGermanyPG-1355000000.02006.012000.06.21.8501
1Black and WhiteRandy Moore143.090.013.0432.0Lee Armstrong977.0169719.0Fantasy|Horror...EnglishUSANot RatedNaN2013.0511.05.21.8500
2Black and WhiteTodd Haynes231.0135.0162.0228.0Heath Ledger23000.04001121.0Biography|Drama|Music...EnglishUSAR20000000.02007.013000.07.02.3500
3ColorMartin Brest94.0105.0102.0383.0Ronny Cox901.0234760500.0Action|Comedy|Crime...EnglishUSAR14000000.01984.0605.07.31.8500
4ColorDario Argento76.0120.0930.0433.0Adrienne Barbeau982.0349618.0Horror...EnglishItalyR9000000.01990.0602.06.11.853750
5ColorTyler Perry36.0113.00.0256.0Mary J. Blige607.051697449.0Comedy|Drama...EnglishUSAPG-1313000000.02009.0269.04.11.8510001
\n", + "

6 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Black and White Steven Zaillian 127.0 128.0 \n", + "1 Black and White Randy Moore 143.0 90.0 \n", + "2 Black and White Todd Haynes 231.0 135.0 \n", + "3 Color Martin Brest 94.0 105.0 \n", + "4 Color Dario Argento 76.0 120.0 \n", + "5 Color Tyler Perry 36.0 113.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 234.0 581.0 Anthony Hopkins \n", + "1 13.0 432.0 Lee Armstrong \n", + "2 162.0 228.0 Heath Ledger \n", + "3 102.0 383.0 Ronny Cox \n", + "4 930.0 433.0 Adrienne Barbeau \n", + "5 0.0 256.0 Mary J. Blige \n", + "\n", + " actor_1_facebook_likes gross genres ... language \\\n", + "0 14000.0 7221458.0 Drama|Thriller ... English \n", + "1 977.0 169719.0 Fantasy|Horror ... English \n", + "2 23000.0 4001121.0 Biography|Drama|Music ... English \n", + "3 901.0 234760500.0 Action|Comedy|Crime ... English \n", + "4 982.0 349618.0 Horror ... English \n", + "5 607.0 51697449.0 Comedy|Drama ... English \n", + "\n", + " country content_rating budget title_year actor_2_facebook_likes \\\n", + "0 Germany PG-13 55000000.0 2006.0 12000.0 \n", + "1 USA Not Rated NaN 2013.0 511.0 \n", + "2 USA R 20000000.0 2007.0 13000.0 \n", + "3 USA R 14000000.0 1984.0 605.0 \n", + "4 Italy R 9000000.0 1990.0 602.0 \n", + "5 USA PG-13 13000000.0 2009.0 269.0 \n", + "\n", + " imdb_score aspect_ratio movie_facebook_likes weights \n", + "0 6.2 1.85 0 1 \n", + "1 5.2 1.85 0 0 \n", + "2 7.0 2.35 0 0 \n", + "3 7.3 1.85 0 0 \n", + "4 6.1 1.85 375 0 \n", + "5 4.1 1.85 1000 1 \n", + "\n", + "[6 rows x 29 columns]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby('color').apply(lambda x: x.sample(n=3)).reset_index(drop = True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bonus: get first and last rows of DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...EnglishUSAPG-13237000000.02009.0936.07.91.78330001
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...EnglishUSAPG-13300000000.02007.05000.07.12.3501
\n", + "

2 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n", + "\n", + " language country content_rating budget title_year \\\n", + "0 English USA PG-13 237000000.0 2009.0 \n", + "1 English USA PG-13 300000000.0 2007.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \\\n", + "0 936.0 7.9 1.78 33000 \n", + "1 5000.0 7.1 2.35 0 \n", + "\n", + " weights \n", + "0 1 \n", + "1 1 \n", + "\n", + "[2 rows x 29 columns]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
5041ColorDaniel Hsia14.0100.00.0489.0Daniel Henney946.010443.0Comedy|Drama|Romance...EnglishUSAPG-13NaN2012.0719.06.32.356601
5042ColorJon Gunn43.090.016.016.0Brian Herzlinger86.085222.0Documentary...EnglishUSAPG1100.02004.023.06.61.854561
\n", + "

2 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "5041 Color Daniel Hsia 14.0 100.0 \n", + "5042 Color Jon Gunn 43.0 90.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "5041 0.0 489.0 Daniel Henney \n", + "5042 16.0 16.0 Brian Herzlinger \n", + "\n", + " actor_1_facebook_likes gross genres ... language \\\n", + "5041 946.0 10443.0 Comedy|Drama|Romance ... English \n", + "5042 86.0 85222.0 Documentary ... English \n", + "\n", + " country content_rating budget title_year actor_2_facebook_likes \\\n", + "5041 USA PG-13 NaN 2012.0 719.0 \n", + "5042 USA PG 1100.0 2004.0 23.0 \n", + "\n", + " imdb_score aspect_ratio movie_facebook_likes weights \n", + "5041 6.3 2.35 660 1 \n", + "5042 6.6 1.85 456 1 \n", + "\n", + "[2 rows x 29 columns]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.tail(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...languagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likesweights
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...EnglishUSAPG-13237000000.02009.0936.07.91.78330001
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...EnglishUSAPG-13300000000.02007.05000.07.12.3501
5041ColorDaniel Hsia14.0100.00.0489.0Daniel Henney946.010443.0Comedy|Drama|Romance...EnglishUSAPG-13NaN2012.0719.06.32.356601
5042ColorJon Gunn43.090.016.016.0Brian Herzlinger86.085222.0Documentary...EnglishUSAPG1100.02004.023.06.61.854561
\n", + "

4 rows × 29 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "5041 Color Daniel Hsia 14.0 100.0 \n", + "5042 Color Jon Gunn 43.0 90.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "5041 0.0 489.0 Daniel Henney \n", + "5042 16.0 16.0 Brian Herzlinger \n", + "\n", + " actor_1_facebook_likes gross genres \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy \n", + "5041 946.0 10443.0 Comedy|Drama|Romance \n", + "5042 86.0 85222.0 Documentary \n", + "\n", + " ... language country content_rating budget title_year \\\n", + "0 ... English USA PG-13 237000000.0 2009.0 \n", + "1 ... English USA PG-13 300000000.0 2007.0 \n", + "5041 ... English USA PG-13 NaN 2012.0 \n", + "5042 ... English USA PG 1100.0 2004.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \\\n", + "0 936.0 7.9 1.78 33000 \n", + "1 5000.0 7.1 2.35 0 \n", + "5041 719.0 6.3 2.35 660 \n", + "5042 23.0 6.6 1.85 456 \n", + "\n", + " weights \n", + "0 1 \n", + "1 1 \n", + "5041 1 \n", + "5042 1 \n", + "\n", + "[4 rows x 29 columns]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# combine head and tail variant 1\n", + "rows = 2\n", + "df.head(rows).append(df.tail(rows))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/22.pandas-how-to-filter-results-of-value_counts.ipynb b/notebooks/pandas/22.pandas-how-to-filter-results-of-value_counts.ipynb new file mode 100644 index 0000000..4fb69be --- /dev/null +++ b/notebooks/pandas/22.pandas-how-to-filter-results-of-value_counts.ipynb @@ -0,0 +1,1019 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 22. Pandas How to filter results of value_counts?" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
\n", + "

2 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "0 3054.0 English USA PG-13 237000000.0 \n", + "1 1238.0 English USA PG-13 300000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "0 2009.0 936.0 7.9 1.78 \n", + "1 2007.0 5000.0 7.1 2.35 \n", + "\n", + " movie_facebook_likes \n", + "0 33000 \n", + "1 0 \n", + "\n", + "[2 rows x 28 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sample of the data\n", + "df.head(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. How value counts works" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 English\n", + "1 English\n", + "2 English\n", + "3 English\n", + "4 NaN\n", + " ... \n", + "5038 English\n", + "5039 English\n", + "5040 English\n", + "5041 English\n", + "5042 English\n", + "Name: language, Length: 5043, dtype: object" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "col = 'language'\n", + "df[col]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "English 4704\n", + "French 73\n", + "Spanish 40\n", + "Hindi 28\n", + "Mandarin 26\n", + "German 19\n", + "Japanese 18\n", + "Russian 11\n", + "Cantonese 11\n", + "Italian 11\n", + "Portuguese 8\n", + "Korean 8\n", + "Arabic 5\n", + "Hebrew 5\n", + "Swedish 5\n", + "Danish 5\n", + "Persian 4\n", + "Norwegian 4\n", + "Polish 4\n", + "Dutch 4\n", + "Chinese 3\n", + "Thai 3\n", + "Icelandic 2\n", + "Dari 2\n", + "Zulu 2\n", + "None 2\n", + "Romanian 2\n", + "Aboriginal 2\n", + "Indonesian 2\n", + "Panjabi 1\n", + "Kazakh 1\n", + "Kannada 1\n", + "Aramaic 1\n", + "Urdu 1\n", + "Dzongkha 1\n", + "Czech 1\n", + "Tamil 1\n", + "Bosnian 1\n", + "Telugu 1\n", + "Hungarian 1\n", + "Filipino 1\n", + "Mongolian 1\n", + "Slovenian 1\n", + "Greek 1\n", + "Vietnamese 1\n", + "Maya 1\n", + "Swahili 1\n", + "Name: language, dtype: int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[col].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dtype('int64')" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[col].value_counts().dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([4704, 73, 40, 28, 26, 19, 18, 11, 11, 11, 8,\n", + " 8, 5, 5, 5, 5, 4, 4, 4, 4, 3, 3,\n", + " 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1,\n", + " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", + " 1, 1, 1])" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[col].value_counts().values" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['English', 'French', 'Spanish', 'Hindi', 'Mandarin', 'German',\n", + " 'Japanese', 'Russian', 'Cantonese', 'Italian', 'Portuguese', 'Korean',\n", + " 'Arabic', 'Hebrew', 'Swedish', 'Danish', 'Persian', 'Norwegian',\n", + " 'Polish', 'Dutch', 'Chinese', 'Thai', 'Icelandic', 'Dari', 'Zulu',\n", + " 'None', 'Romanian', 'Aboriginal', 'Indonesian', 'Panjabi', 'Kazakh',\n", + " 'Kannada', 'Aramaic', 'Urdu', 'Dzongkha', 'Czech', 'Tamil', 'Bosnian',\n", + " 'Telugu', 'Hungarian', 'Filipino', 'Mongolian', 'Slovenian', 'Greek',\n", + " 'Vietnamese', 'Maya', 'Swahili'],\n", + " dtype='object')" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[col].value_counts().index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Filter value_counts with isin" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2388 Chinese\n", + "2740 Thai\n", + "3022 Chinese\n", + "3311 Thai\n", + "3427 Chinese\n", + "3659 Thai\n", + "Name: language, dtype: object" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)].language" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 True\n", + "1 True\n", + "2 True\n", + "3 True\n", + "4 False\n", + " ... \n", + "5038 True\n", + "5039 True\n", + "5040 True\n", + "5041 True\n", + "5042 True\n", + "Name: language, Length: 5043, dtype: bool" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[col].isin(df[col].value_counts().index)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "English False\n", + "French False\n", + "Spanish False\n", + "Hindi False\n", + "Mandarin False\n", + "German False\n", + "Japanese False\n", + "Russian False\n", + "Cantonese False\n", + "Italian False\n", + "Portuguese False\n", + "Korean False\n", + "Arabic False\n", + "Hebrew False\n", + "Swedish False\n", + "Danish False\n", + "Persian False\n", + "Norwegian False\n", + "Polish False\n", + "Dutch False\n", + "Chinese True\n", + "Thai True\n", + "Icelandic False\n", + "Dari False\n", + "Zulu False\n", + "None False\n", + "Romanian False\n", + "Aboriginal False\n", + "Indonesian False\n", + "Panjabi False\n", + "Kazakh False\n", + "Kannada False\n", + "Aramaic False\n", + "Urdu False\n", + "Dzongkha False\n", + "Czech False\n", + "Tamil False\n", + "Bosnian False\n", + "Telugu False\n", + "Hungarian False\n", + "Filipino False\n", + "Mongolian False\n", + "Slovenian False\n", + "Greek False\n", + "Vietnamese False\n", + "Maya False\n", + "Swahili False\n", + "Name: language, dtype: bool" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['language'].value_counts()==3" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Chinese 3\n", + "Thai 3\n", + "Name: language, dtype: int64" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['language'].value_counts()[df['language'].value_counts()==3].index" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 False\n", + "4 False\n", + " ... \n", + "5038 False\n", + "5039 False\n", + "5040 False\n", + "5041 False\n", + "5042 False\n", + "Name: language, Length: 5043, dtype: bool" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==1].index)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2388 Chinese\n", + "2740 Thai\n", + "3022 Chinese\n", + "3311 Thai\n", + "3427 Chinese\n", + "3659 Thai\n", + "Name: language, dtype: object" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)]['language']" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "English 4704\n", + "French 73\n", + "Spanish 40\n", + "Hindi 28\n", + "Mandarin 26\n", + "German 19\n", + "Japanese 18\n", + "Russian 11\n", + "Cantonese 11\n", + "Italian 11\n", + "Name: language, dtype: int64" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['language'].value_counts()[df['language'].value_counts()> 10]" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 English\n", + "1 English\n", + "2 English\n", + "3 English\n", + "5 English\n", + " ... \n", + "5038 English\n", + "5039 English\n", + "5040 English\n", + "5041 English\n", + "5042 English\n", + "Name: language, Length: 4941, dtype: object" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() > 10].index)].language" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['Chinese', 'Thai'], dtype=object)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts() == 3].index)].language.unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Use group by and lambda to simulate filter on value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
2388ColorDanny Pang23.0107.015.027.0Angelica Lee82.0NaNAction...4.0ChineseChinaPG-13NaN2013.039.05.71.85124
2740ColorTony Jaa110.0110.00.07.0Petchtai Wongkamlao64.0102055.0Action...72.0ThaiThailandR300000000.02008.045.06.22.350
3022ColorMabel Cheung6.0130.03.02.0Ching Wan Lau215.0NaNDrama...6.0ChineseChinaNaN12000000.02015.027.06.22.354
3311ColorChatrichalerm Yukol31.0300.06.06.0Chatchai Plengpanich7.0454255.0Action|Adventure|Drama|History|War...47.0ThaiThailandR400000000.02001.06.06.61.85124
3427ColorDennie Gordon11.0114.029.011.0Ruby Lin163.050000.0Action|Adventure|Comedy|Romance...2.0ChineseChinaNaNNaN2013.020.05.12.3581
3659ColorPrachya Pinkaew112.0111.064.0380.0Nathan Jones778.011905519.0Action|Crime|Drama|Thriller...214.0ThaiThailandR200000000.02005.0635.07.11.850
\n", + "

6 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "2388 Color Danny Pang 23.0 107.0 \n", + "2740 Color Tony Jaa 110.0 110.0 \n", + "3022 Color Mabel Cheung 6.0 130.0 \n", + "3311 Color Chatrichalerm Yukol 31.0 300.0 \n", + "3427 Color Dennie Gordon 11.0 114.0 \n", + "3659 Color Prachya Pinkaew 112.0 111.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "2388 15.0 27.0 Angelica Lee \n", + "2740 0.0 7.0 Petchtai Wongkamlao \n", + "3022 3.0 2.0 Ching Wan Lau \n", + "3311 6.0 6.0 Chatchai Plengpanich \n", + "3427 29.0 11.0 Ruby Lin \n", + "3659 64.0 380.0 Nathan Jones \n", + "\n", + " actor_1_facebook_likes gross genres \\\n", + "2388 82.0 NaN Action \n", + "2740 64.0 102055.0 Action \n", + "3022 215.0 NaN Drama \n", + "3311 7.0 454255.0 Action|Adventure|Drama|History|War \n", + "3427 163.0 50000.0 Action|Adventure|Comedy|Romance \n", + "3659 778.0 11905519.0 Action|Crime|Drama|Thriller \n", + "\n", + " ... num_user_for_reviews language country content_rating \\\n", + "2388 ... 4.0 Chinese China PG-13 \n", + "2740 ... 72.0 Thai Thailand R \n", + "3022 ... 6.0 Chinese China NaN \n", + "3311 ... 47.0 Thai Thailand R \n", + "3427 ... 2.0 Chinese China NaN \n", + "3659 ... 214.0 Thai Thailand R \n", + "\n", + " budget title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "2388 NaN 2013.0 39.0 5.7 1.85 \n", + "2740 300000000.0 2008.0 45.0 6.2 2.35 \n", + "3022 12000000.0 2015.0 27.0 6.2 2.35 \n", + "3311 400000000.0 2001.0 6.0 6.6 1.85 \n", + "3427 NaN 2013.0 20.0 5.1 2.35 \n", + "3659 200000000.0 2005.0 635.0 7.1 1.85 \n", + "\n", + " movie_facebook_likes \n", + "2388 124 \n", + "2740 0 \n", + "3022 4 \n", + "3311 124 \n", + "3427 81 \n", + "3659 0 \n", + "\n", + "[6 rows x 28 columns]" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby('language').filter(lambda x: len(x) == 3)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2388 Chinese\n", + "2740 Thai\n", + "3022 Chinese\n", + "3311 Thai\n", + "3427 Chinese\n", + "3659 Thai\n", + "Name: language, dtype: object" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby('language').filter(lambda x: len(x) == 3)['language']" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['Chinese', 'Thai'], dtype=object)" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.groupby('language').filter(lambda x: len(x) == 3)['language'].unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Bonus: Which is faster?" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100 loops, best of 3: 10.9 ms per loop\n" + ] + } + ], + "source": [ + "%timeit df.groupby('language').filter(lambda x: len(x) == 3)['language']" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "100 loops, best of 3: 3.19 ms per loop\n" + ] + } + ], + "source": [ + "%timeit df[df['language'].isin(df['language'].value_counts()[df['language'].value_counts()==3].index)]['language']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/23.pandas-typeerror-unhashable-type-list-dict.ipynb b/notebooks/pandas/23.pandas-typeerror-unhashable-type-list-dict.ipynb new file mode 100644 index 0000000..281a81d --- /dev/null +++ b/notebooks/pandas/23.pandas-typeerror-unhashable-type-list-dict.ipynb @@ -0,0 +1,1056 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 23. Pandas TypeError: unhashable type: 'list'/'dict'\n", + "\n", + "Topics\n", + "\n", + "* apply value_counts for list/dict column\n", + "* value_counts for list column\n", + "* identify list/dict columns\n", + "* `TypeError: unhashable type: 'dict'`\n", + "* `TypeError: unhashable type: 'list'`\n", + "* Correct way to expand list column" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', -1)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({'col1': [1, 2], 'col2': [[0.5, 0.1], [0.75, 0.25]],'col3': [{0:'a', 1:'b'}, {0:'c', 1:'d'}]})" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
01[0.5, 0.1]{0: 'a', 1: 'b'}
12[0.75, 0.25]{0: 'c', 1: 'd'}
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "0 1 [0.5, 0.1] {0: 'a', 1: 'b'}\n", + "1 2 [0.75, 0.25] {0: 'c', 1: 'd'}" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. TypeError: unhashable type: 'list'/'dict'" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "unhashable type: 'list'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# TypeError: unhashable type: 'list'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcol2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/base.py\u001b[0m in \u001b[0;36mvalue_counts\u001b[0;34m(self, normalize, sort, ascending, bins, dropna)\u001b[0m\n\u001b[1;32m 1390\u001b[0m \u001b[0mnormalize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnormalize\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1391\u001b[0m \u001b[0mbins\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbins\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1392\u001b[0;31m \u001b[0mdropna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1393\u001b[0m )\n\u001b[1;32m 1394\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36mvalue_counts\u001b[0;34m(values, sort, ascending, normalize, bins, dropna)\u001b[0m\n\u001b[1;32m 755\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 756\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 757\u001b[0;31m \u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_value_counts_arraylike\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 758\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 759\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mIndex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36m_value_counts_arraylike\u001b[0;34m(values, dropna)\u001b[0m\n\u001b[1;32m 800\u001b[0m \u001b[0;31m# TODO: handle uint8\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 801\u001b[0m \u001b[0mf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"value_count_{dtype}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mndtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 802\u001b[0;31m \u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 803\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 804\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_func_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.value_count_object\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_func_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.value_count_object\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'list'" + ] + } + ], + "source": [ + "# TypeError: unhashable type: 'list'\n", + "df.col2.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "unhashable type: 'dict'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# TypeError: unhashable type: 'dict'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcol3\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/base.py\u001b[0m in \u001b[0;36mvalue_counts\u001b[0;34m(self, normalize, sort, ascending, bins, dropna)\u001b[0m\n\u001b[1;32m 1390\u001b[0m \u001b[0mnormalize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnormalize\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1391\u001b[0m \u001b[0mbins\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbins\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1392\u001b[0;31m \u001b[0mdropna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdropna\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1393\u001b[0m )\n\u001b[1;32m 1394\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36mvalue_counts\u001b[0;34m(values, sort, ascending, normalize, bins, dropna)\u001b[0m\n\u001b[1;32m 755\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 756\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 757\u001b[0;31m \u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_value_counts_arraylike\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 758\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 759\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mIndex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36m_value_counts_arraylike\u001b[0;34m(values, dropna)\u001b[0m\n\u001b[1;32m 800\u001b[0m \u001b[0;31m# TODO: handle uint8\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 801\u001b[0m \u001b[0mf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhtable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"value_count_{dtype}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mndtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 802\u001b[0;31m \u001b[0mkeys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcounts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdropna\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 803\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 804\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_func_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.value_count_object\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_func_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.value_count_object\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'dict'" + ] + } + ], + "source": [ + "# TypeError: unhashable type: 'dict'\n", + "df.col3.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.groupby('col3').transform({'col1': [min], 'col2': max})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. How to detect if column contains list or dict" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "col1 int64 \n", + "col2 object\n", + "col3 object\n", + "dtype: object" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "col1 False\n", + "col2 True \n", + "col3 False\n", + "dtype: bool" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# detect list columns\n", + "df.applymap(lambda x: isinstance(x, list)).all()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "col1 False\n", + "col2 False\n", + "col3 True \n", + "dtype: bool" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# detect dict columns\n", + "df.applymap(lambda x: isinstance(x, dict)).all()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "col1 False\n", + "col2 True \n", + "col3 True \n", + "dtype: bool" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# detect dict or list columns\n", + "df.applymap(lambda x: isinstance(x, dict) or isinstance(x, list)).all()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.1 Convert the column to string and apply value_counts" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.75, 0.25] 1\n", + "[0.5, 0.1] 1\n", + "Name: col2, dtype: int64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['col2'].astype('str').value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 'c', 1: 'd'} 1\n", + "{0: 'a', 1: 'b'} 1\n", + "Name: col3, dtype: int64" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['col3'].astype('str').value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3.2 Convert the column to string and use group by" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "unhashable type: 'dict'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# TypeError: unhashable type: 'dict'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcol3\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnotna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'col3'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/generic.py\u001b[0m in \u001b[0;36mcount\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1594\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1595\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_data_to_aggregate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1596\u001b[0;31m \u001b[0mids\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mngroups\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgrouper\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup_info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1597\u001b[0m \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mids\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1598\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/properties.pyx\u001b[0m in \u001b[0;36mpandas._libs.properties.CachedProperty.__get__\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/ops.py\u001b[0m in \u001b[0;36mgroup_info\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 294\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mcache_readonly\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 295\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mgroup_info\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 296\u001b[0;31m \u001b[0mcomp_ids\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mobs_group_ids\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_compressed_labels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 297\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 298\u001b[0m \u001b[0mngroups\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobs_group_ids\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/ops.py\u001b[0m in \u001b[0;36m_get_compressed_labels\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_get_compressed_labels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m \u001b[0mall_labels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mping\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlabels\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mping\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupings\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 313\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall_labels\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 314\u001b[0m \u001b[0mgroup_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_group_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall_labels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxnull\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/ops.py\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 310\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 311\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_get_compressed_labels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 312\u001b[0;31m \u001b[0mall_labels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mping\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlabels\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mping\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupings\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 313\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall_labels\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 314\u001b[0m \u001b[0mgroup_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_group_index\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall_labels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxnull\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py\u001b[0m in \u001b[0;36mlabels\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 395\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 396\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_labels\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 397\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_labels\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 398\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_labels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 399\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py\u001b[0m in \u001b[0;36m_make_labels\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 419\u001b[0m \u001b[0muniques\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgrouper\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresult_index\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 420\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 421\u001b[0;31m \u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muniques\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0malgorithms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfactorize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgrouper\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msort\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 422\u001b[0m \u001b[0muniques\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mIndex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0muniques\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 423\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_labels\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlabels\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/util/_decorators.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 206\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 207\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mnew_arg_name\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnew_arg_value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 208\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 209\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 210\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mwrapper\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36mfactorize\u001b[0;34m(values, sort, order, na_sentinel, size_hint)\u001b[0m\n\u001b[1;32m 670\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 671\u001b[0m labels, uniques = _factorize_array(\n\u001b[0;32m--> 672\u001b[0;31m \u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_sentinel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_sentinel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msize_hint\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msize_hint\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_value\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 673\u001b[0m )\n\u001b[1;32m 674\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/algorithms.py\u001b[0m in \u001b[0;36m_factorize_array\u001b[0;34m(values, na_sentinel, size_hint, na_value)\u001b[0m\n\u001b[1;32m 506\u001b[0m \u001b[0mtable\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhash_klass\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msize_hint\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 507\u001b[0m uniques, labels = table.factorize(\n\u001b[0;32m--> 508\u001b[0;31m \u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_sentinel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_sentinel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mna_value\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mna_value\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 509\u001b[0m )\n\u001b[1;32m 510\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.factorize\u001b[0;34m()\u001b[0m\n", + "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable._unique\u001b[0;34m()\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: unhashable type: 'dict'" + ] + } + ], + "source": [ + "# TypeError: unhashable type: 'dict'\n", + "df[df.col3.notna()].groupby(['col3']).count()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col3
col2
[0.5, 0.1]11
[0.75, 0.25]11
\n", + "
" + ], + "text/plain": [ + " col1 col3\n", + "col2 \n", + "[0.5, 0.1] 1 1 \n", + "[0.75, 0.25] 1 1 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.col2.notna()].astype('str').groupby(['col2']).count()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2
col3
{0: 'a', 1: 'b'}11
{0: 'c', 1: 'd'}11
\n", + "
" + ], + "text/plain": [ + " col1 col2\n", + "col3 \n", + "{0: 'a', 1: 'b'} 1 1 \n", + "{0: 'c', 1: 'd'} 1 1 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.col3.notna()].astype('str').groupby(['col3']).count()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Convert list/dict column to tuple" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(0.5, 0.1) 1\n", + "(0.75, 0.25) 1\n", + "Name: col2, dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# for list\n", + "df['col2'].apply(tuple).value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(0, 1) 2\n", + "Name: col3, dtype: int64" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# for dict\n", + "df['col3'].apply(tuple).value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Expand the list column" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.75 1\n", + "0.50 1\n", + "Name: 0, dtype: int64" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.col2.apply(pd.Series)[0].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.10 1\n", + "0.25 1\n", + "Name: 1, dtype: int64" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.col2.apply(pd.Series)[1].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. List column mixed: strings and list items" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({'col1': [1, 2], 'col2': [[0.5], 3],'col3': [{0:'a', 1:'b'}, {0:'c', 1:'d'}]})" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
01[0.5]{0: 'a', 1: 'b'}
123{0: 'c', 1: 'd'}
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "0 1 [0.5] {0: 'a', 1: 'b'}\n", + "1 2 3 {0: 'c', 1: 'd'}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3.0 1\n", + "0.5 1\n", + "Name: col2, dtype: int64" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.applymap(lambda x: x[0] if isinstance(x, list) else x)['col2'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus Step #1: Correct way to expand list column" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({'col1': [1, 2], 'col2': [[0.5, 0.1], [0.75, 0.25]],'col3': [{0:'a', 1:'b'}, {0:'c', 1:'d'}]})" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
01[0.5, 0.1]{0: 'a', 1: 'b'}
12[0.75, 0.25]{0: 'c', 1: 'd'}
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "0 1 [0.5, 0.1] {0: 'a', 1: 'b'}\n", + "1 2 [0.75, 0.25] {0: 'c', 1: 'd'}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 NaN\n", + "1 NaN\n", + "Name: col2, dtype: float64" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.col2.str.split(',', expand=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
0[0.50.1]
1[0.750.25]
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 [0.5 0.1] \n", + "1 [0.75 0.25]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.col2.astype('str').str.split(',', expand=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
00.500.10
10.750.25
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 0.50 0.10\n", + "1 0.75 0.25" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.col2.apply(pd.Series)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "df[['l1', 'l2']] = df.col2.apply(pd.Series)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
l1l2
0.500.101[0.5, 0.1]{0: 'a', 1: 'b'}
0.750.252[0.75, 0.25]{0: 'c', 1: 'd'}
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "l1 l2 \n", + "0.50 0.10 1 [0.5, 0.1] {0: 'a', 1: 'b'}\n", + "0.75 0.25 2 [0.75, 0.25] {0: 'c', 1: 'd'}" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.set_index(['l1', 'l2'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/24-pandas-check-value-column-contained-another-column-same-row.ipynb b/notebooks/pandas/24-pandas-check-value-column-contained-another-column-same-row.ipynb new file mode 100644 index 0000000..d0c667c --- /dev/null +++ b/notebooks/pandas/24-pandas-check-value-column-contained-another-column-same-row.ipynb @@ -0,0 +1,896 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 24. Pandas: Check If Value of Column Is Contained in Another Column in the Same Row" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titleplot_keywordscountry
0Avataravatar|future|marine|native|paraplegicUSA
1Pirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...USA
2Spectrebomb|espionage|sequel|spy|terroristUK
3The Dark Knight Risesdeception|imprisonment|lawlessness|police offi...USA
4Star Wars: Episode VII - The Force Awakens  ...NaNNaN
5John Carteralien|american civil war|male nipple|mars|prin...USA
6Spider-Man 3sandman|spider man|symbiote|venom|villainUSA
7Tangled17th century|based on fairy tale|disney|flower...USA
8Avengers: Age of Ultronartificial intelligence|based on comic book|ca...USA
9Harry Potter and the Half-Blood Princeblood|book|love|potion|professorUK
\n", + "
" + ], + "text/plain": [ + " movie_title \\\n", + "0 Avatar  \n", + "1 Pirates of the Caribbean: At World's End  \n", + "2 Spectre  \n", + "3 The Dark Knight Rises  \n", + "4 Star Wars: Episode VII - The Force Awakens  ... \n", + "5 John Carter  \n", + "6 Spider-Man 3  \n", + "7 Tangled  \n", + "8 Avengers: Age of Ultron  \n", + "9 Harry Potter and the Half-Blood Prince  \n", + "\n", + " plot_keywords country \n", + "0 avatar|future|marine|native|paraplegic USA \n", + "1 goddess|marriage ceremony|marriage proposal|pi... USA \n", + "2 bomb|espionage|sequel|spy|terrorist UK \n", + "3 deception|imprisonment|lawlessness|police offi... USA \n", + "4 NaN NaN \n", + "5 alien|american civil war|male nipple|mars|prin... USA \n", + "6 sandman|spider man|symbiote|venom|villain USA \n", + "7 17th century|based on fairy tale|disney|flower... USA \n", + "8 artificial intelligence|based on comic book|ca... USA \n", + "9 blood|book|love|potion|professor UK " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[['movie_title', 'plot_keywords', 'country']].head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Check If String Column Contains Substring of Another with Function" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titlecountry
196AustraliaAustralia
2504McFarland, USAUSA
\n", + "
" + ], + "text/plain": [ + " movie_title country\n", + "196 Australia  Australia\n", + "2504 McFarland, USA  USA" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def find_value_column(row):\n", + " return row.country in row.movie_title\n", + "\n", + "df.country.fillna('_', inplace=True)\n", + "df[df.apply(find_value_column, axis=1)][['movie_title', 'country']].head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "for row in df.loc[df.plot_keywords.isnull(), 'plot_keywords'].index:\n", + " df.at[row, 'plot_keywords'] = []" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titleplot_keywords
0Avataravatar|future|marine|native|paraplegic
22Robin Hood1190s|archer|england|king of england|robin hood
25King Konganimal name in title|ape abducts a woman|goril...
26Titanicartist|love|ship|titanic|wet
33Alice in Wonderlandalice in wonderland|mistaking reality for drea...
130Thorbattle|marvel cinematic universe|scientist|tho...
145Pan1940s|child hero|fantasy world|orphan|referenc...
147Troygreek|mythology|prince|trojan|troy
150Ghostbustersghost|ghostbuster|ghostbusters|male objectific...
160Star Trekbox office hit|future|lifted by the throat|sta...
\n", + "
" + ], + "text/plain": [ + " movie_title plot_keywords\n", + "0 Avatar  avatar|future|marine|native|paraplegic\n", + "22 Robin Hood  1190s|archer|england|king of england|robin hood\n", + "25 King Kong  animal name in title|ape abducts a woman|goril...\n", + "26 Titanic  artist|love|ship|titanic|wet\n", + "33 Alice in Wonderland  alice in wonderland|mistaking reality for drea...\n", + "130 Thor  battle|marvel cinematic universe|scientist|tho...\n", + "145 Pan  1940s|child hero|fantasy world|orphan|referenc...\n", + "147 Troy  greek|mythology|prince|trojan|troy\n", + "150 Ghostbusters  ghost|ghostbuster|ghostbusters|male objectific...\n", + "160 Star Trek  box office hit|future|lifted by the throat|sta..." + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def find_value_column(row):\n", + " return row.movie_title.lower().strip() in row.plot_keywords\n", + "\n", + "df[df.apply(find_value_column, axis=1)][['movie_title', 'plot_keywords']].head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Check If Column contains another column with lambda" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titlecountry
196AustraliaAustralia
2504McFarland, USAUSA
\n", + "
" + ], + "text/plain": [ + " movie_title country\n", + "196 Australia  Australia\n", + "2504 McFarland, USA  USA" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.apply(lambda x: x.country in x.movie_title, axis=1)][['movie_title', 'country']].head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "ename": "TypeError", + "evalue": "(\"'Series' objects are mutable, thus they cannot be hashed\", 'occurred at index 0')", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Warning for common error\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mrow\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcountry\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmovie_title\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36mapply\u001b[0;34m(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)\u001b[0m\n\u001b[1;32m 6904\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6905\u001b[0m )\n\u001b[0;32m-> 6906\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6907\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6908\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mapplymap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/apply.py\u001b[0m in \u001b[0;36mget_result\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 184\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_raw\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 186\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_standard\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 187\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 188\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mapply_empty_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/apply.py\u001b[0m in \u001b[0;36mapply_standard\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 290\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 291\u001b[0m \u001b[0;31m# compute the result using the series generator\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 292\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply_series_generator\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 293\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 294\u001b[0m \u001b[0;31m# wrap results\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/apply.py\u001b[0m in \u001b[0;36mapply_series_generator\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 319\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 320\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mv\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mseries_gen\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 321\u001b[0;31m \u001b[0mresults\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 322\u001b[0m \u001b[0mkeys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 323\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(row)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Warning for common error\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mrow\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcountry\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmovie_title\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m__contains__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1935\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__contains__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1936\u001b[0m \u001b[0;34m\"\"\"True if the key is in the info axis\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1937\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_info_axis\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1938\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1939\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/indexes/range.py\u001b[0m in \u001b[0;36m__contains__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 362\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 363\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__contains__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mUnion\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mint\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minteger\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0mbool\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 364\u001b[0;31m \u001b[0mhash\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 365\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 366\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mensure_python_int\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m__hash__\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 1885\u001b[0m raise TypeError(\n\u001b[1;32m 1886\u001b[0m \u001b[0;34m\"{0!r} objects are mutable, thus they cannot be\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1887\u001b[0;31m \u001b[0;34m\" hashed\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__class__\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1888\u001b[0m )\n\u001b[1;32m 1889\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mTypeError\u001b[0m: (\"'Series' objects are mutable, thus they cannot be hashed\", 'occurred at index 0')" + ] + } + ], + "source": [ + "# Warning for common error\n", + "df.apply(lambda row: df.country in df.movie_title, axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Fastest Way to Check If One Column Contains Another" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "df['country'].fillna('Uknown', inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titlecountry
196AustraliaAustralia
2504McFarland, USAUSA
\n", + "
" + ], + "text/plain": [ + " movie_title country\n", + "196 Australia  Australia\n", + "2504 McFarland, USA  USA" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[x[0] in x[1] for x in zip(df['country'], df['movie_title'])]][['movie_title', 'country']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: For Loop and df.iterrows() Version" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Australia Australia \n", + "USA McFarland, USA \n" + ] + } + ], + "source": [ + "for i, row in df.iterrows():\n", + " if row.country in row.movie_title:\n", + " print(row.country, row.movie_title)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bonus Step: Check If List Column Contains Substring of Another with Function" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "df['keywords'] = df.plot_keywords.str.split('|')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 [avatar, future, marine, native, paraplegic]\n", + "1 [goddess, marriage ceremony, marriage proposal...\n", + "2 [bomb, espionage, sequel, spy, terrorist]\n", + "3 [deception, imprisonment, lawlessness, police ...\n", + "4 NaN\n", + " ... \n", + "5038 [fraud, postal worker, prison, theft, trial]\n", + "5039 [cult, fbi, hideout, prison escape, serial kil...\n", + "5040 NaN\n", + "5041 NaN\n", + "5042 [actress name in title, crush, date, four word...\n", + "Name: keywords, Length: 5043, dtype: object" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['keywords']" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titlekeywords
0Avatar[avatar, future, marine, native, paraplegic]
9Harry Potter and the Half-Blood Prince[blood, book, love, potion, professor]
33Alice in Wonderland[alice in wonderland, mistaking reality for dr...
68Monsters vs. Aliens[alien, alien invasion, alien space craft, gia...
77G.I. Joe: The Rise of Cobra[cobra, gi joe, snake, train, warhead]
\n", + "
" + ], + "text/plain": [ + " movie_title \\\n", + "0 Avatar  \n", + "9 Harry Potter and the Half-Blood Prince  \n", + "33 Alice in Wonderland  \n", + "68 Monsters vs. Aliens  \n", + "77 G.I. Joe: The Rise of Cobra  \n", + "\n", + " keywords \n", + "0 [avatar, future, marine, native, paraplegic] \n", + "9 [blood, book, love, potion, professor] \n", + "33 [alice in wonderland, mistaking reality for dr... \n", + "68 [alien, alien invasion, alien space craft, gia... \n", + "77 [cobra, gi joe, snake, train, warhead] " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def find_value_column(row):\n", + " if isinstance(row['keywords'], list):\n", + " for keyword in row['keywords']:\n", + " return keyword in row.movie_title.lower()\n", + " else:\n", + " return False\n", + "\n", + "df[df.apply(find_value_column, axis=1)][['movie_title', 'keywords']].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "df['keywords'] = df['keywords'].apply(lambda d: d if isinstance(d, list) else [])" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
movie_titlekeywords
0Avatar[avatar, future, marine, native, paraplegic]
9Harry Potter and the Half-Blood Prince[blood, book, love, potion, professor]
33Alice in Wonderland[alice in wonderland, mistaking reality for dr...
68Monsters vs. Aliens[alien, alien invasion, alien space craft, gia...
77G.I. Joe: The Rise of Cobra[cobra, gi joe, snake, train, warhead]
\n", + "
" + ], + "text/plain": [ + " movie_title \\\n", + "0 Avatar  \n", + "9 Harry Potter and the Half-Blood Prince  \n", + "33 Alice in Wonderland  \n", + "68 Monsters vs. Aliens  \n", + "77 G.I. Joe: The Rise of Cobra  \n", + "\n", + " keywords \n", + "0 [avatar, future, marine, native, paraplegic] \n", + "9 [blood, book, love, potion, professor] \n", + "33 [alice in wonderland, mistaking reality for dr... \n", + "68 [alien, alien invasion, alien space craft, gia... \n", + "77 [cobra, gi joe, snake, train, warhead] " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def find_value_column(row):\n", + " for keyword in row['keywords']:\n", + " return keyword in row.movie_title.lower()\n", + " return False\n", + "\n", + "df[df.apply(find_value_column, axis=1)][['movie_title', 'keywords']].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Performance" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10 loops, best of 3: 154 ms per loop\n" + ] + } + ], + "source": [ + "%%timeit\n", + "def find_value_column(row):\n", + " return row.country in row.movie_title\n", + "\n", + "df[df.apply(find_value_column, axis=1)]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10 loops, best of 3: 155 ms per loop\n" + ] + } + ], + "source": [ + "%%timeit\n", + "df[df.apply(lambda x: x.country in x.movie_title, axis=1)]" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1000 loops, best of 3: 1.76 ms per loop\n" + ] + } + ], + "source": [ + "%%timeit\n", + "df[[x[0] in x[1] for x in zip(df['country'], df['movie_title'])]]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1 loop, best of 3: 599 ms per loop\n" + ] + } + ], + "source": [ + "%%timeit\n", + "for i, row in df.iterrows():\n", + " if row.country in row.movie_title:\n", + " pass" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/25_Pandas_Create_A_Matplotlib_Scatterplot_From_A_Dataframe.ipynb b/notebooks/pandas/25_Pandas_Create_A_Matplotlib_Scatterplot_From_A_Dataframe.ipynb new file mode 100644 index 0000000..8e593db --- /dev/null +++ b/notebooks/pandas/25_Pandas_Create_A_Matplotlib_Scatterplot_From_A_Dataframe.ipynb @@ -0,0 +1,964 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 25. Pandas: Create A Matplotlib Scatterplot From A Dataframe " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Datasets:\n", + "* https://www.kaggle.com/statchaitya/country-to-continent\n", + "* https://www.kaggle.com/erikbruin/countries-of-the-world-iso-codes-and-population\n", + "* https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset\n", + "\n", + "\"Drawing\"\n", + "\"Drawing\"" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import chardet\n", + "import pandas as pd\n", + "\n", + "df = pd.read_csv(\"../csv/covid/covid_19_clean_complete.csv\")\n", + "population = pd.read_csv(\"../csv/covid/countries_by_population_2019.csv\")\n", + "\n", + "with open('../csv/covid/countryContinent.csv', 'rb') as f:\n", + " result = chardet.detect(f.read()) # or readline if the file is large\n", + "\n", + "continent = pd.read_csv(\"../csv/covid/countryContinent.csv\" , encoding=result['encoding'])" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Province/StateCountry/RegionLatLongDateConfirmedDeathsRecovered
15061NaNBarbados13.1939-59.54323/17/20200
15062NaNMontenegro42.500019.30003/17/20200
15063NaNThe Gambia13.4667-16.60003/17/20100
\n", + "
" + ], + "text/plain": [ + " Province/State Country/Region Lat Long Date Confirmed \\\n", + "15061 NaN Barbados 13.1939 -59.5432 3/17/20 2 \n", + "15062 NaN Montenegro 42.5000 19.3000 3/17/20 2 \n", + "15063 NaN The Gambia 13.4667 -16.6000 3/17/20 1 \n", + "\n", + " Deaths Recovered \n", + "15061 0 0 \n", + "15062 0 0 \n", + "15063 0 0 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.tail(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Ranknamepop2019pop2018GrowthRateareaDensity
01China1433783.686NaN1.00399706961.0147.7068
12India1366417.754NaN1.00993287590.0415.6290
23United States329064.917NaN1.00599372610.035.1092
\n", + "
" + ], + "text/plain": [ + " Rank name pop2019 pop2018 GrowthRate area Density\n", + "0 1 China 1433783.686 NaN 1.0039 9706961.0 147.7068\n", + "1 2 India 1366417.754 NaN 1.0099 3287590.0 415.6290\n", + "2 3 United States 329064.917 NaN 1.0059 9372610.0 35.1092" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "population.head(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countrycode_2code_3country_codeiso_3166_2continentsub_regionregion_codesub_region_code
0AfghanistanAFAFG4ISO 3166-2:AFAsiaSouthern Asia142.034.0
1Åland IslandsAXALA248ISO 3166-2:AXEuropeNorthern Europe150.0154.0
2AlbaniaALALB8ISO 3166-2:ALEuropeSouthern Europe150.039.0
\n", + "
" + ], + "text/plain": [ + " country code_2 code_3 country_code iso_3166_2 continent \\\n", + "0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia \n", + "1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe \n", + "2 Albania AL ALB 8 ISO 3166-2:AL Europe \n", + "\n", + " sub_region region_code sub_region_code \n", + "0 Southern Asia 142.0 34.0 \n", + "1 Northern Europe 150.0 154.0 \n", + "2 Southern Europe 150.0 39.0 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "continent.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #1: Combine covid and continent data" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "df = df.merge(continent, left_on='Country/Region', right_on='country', how='inner')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Province/StateCountry/RegionLatLongDateConfirmedDeathsRecoveredcountrycode_2code_3country_codeiso_3166_2continentsub_regionregion_codesub_region_code
0NaNThailand15.0101.01/22/20200ThailandTHTHA764ISO 3166-2:THAsiaSouth-Eastern Asia142.035.0
1NaNThailand15.0101.01/23/20300ThailandTHTHA764ISO 3166-2:THAsiaSouth-Eastern Asia142.035.0
2NaNThailand15.0101.01/24/20500ThailandTHTHA764ISO 3166-2:THAsiaSouth-Eastern Asia142.035.0
\n", + "
" + ], + "text/plain": [ + " Province/State Country/Region Lat Long Date Confirmed Deaths \\\n", + "0 NaN Thailand 15.0 101.0 1/22/20 2 0 \n", + "1 NaN Thailand 15.0 101.0 1/23/20 3 0 \n", + "2 NaN Thailand 15.0 101.0 1/24/20 5 0 \n", + "\n", + " Recovered country code_2 code_3 country_code iso_3166_2 continent \\\n", + "0 0 Thailand TH THA 764 ISO 3166-2:TH Asia \n", + "1 0 Thailand TH THA 764 ISO 3166-2:TH Asia \n", + "2 0 Thailand TH THA 764 ISO 3166-2:TH Asia \n", + "\n", + " sub_region region_code sub_region_code \n", + "0 South-Eastern Asia 142.0 35.0 \n", + "1 South-Eastern Asia 142.0 35.0 \n", + "2 South-Eastern Asia 142.0 35.0 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #2: Get last value for Confirmed per country" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ConfirmedCountry/RegionRecovered
022Afghanistan1
155Albania0
260Algeria12
339Andorra1
41Antigua and Barbuda0
\n", + "
" + ], + "text/plain": [ + " Confirmed Country/Region Recovered\n", + "0 22 Afghanistan 1\n", + "1 55 Albania 0\n", + "2 60 Algeria 12\n", + "3 39 Andorra 1\n", + "4 1 Antigua and Barbuda 0" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "last_confirmed_number = df[df.Confirmed > 0].groupby('Country/Region', as_index = False).last()[['Confirmed', 'Country/Region', 'Recovered']]\n", + "last_confirmed_number.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #3: Get first date of Confirmed per country" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateCountry/Regioncontinent
02/24/20AfghanistanAsia
13/9/20AlbaniaEurope
22/25/20AlgeriaAfrica
33/2/20AndorraEurope
43/13/20Antigua and BarbudaAmericas
\n", + "
" + ], + "text/plain": [ + " Date Country/Region continent\n", + "0 2/24/20 Afghanistan Asia\n", + "1 3/9/20 Albania Europe\n", + "2 2/25/20 Algeria Africa\n", + "3 3/2/20 Andorra Europe\n", + "4 3/13/20 Antigua and Barbuda Americas" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "first_date = df[df.Confirmed > 0].groupby('Country/Region', as_index = False).first()[['Date', 'Country/Region', 'continent']]\n", + "first_date.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #4: Combine last values and first date" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ConfirmedCountry/RegionRecoveredDatecontinent
92236Pakistan22/26/20Asia
97238Poland133/4/20Europe
109266Singapore1141/23/20Asia
111275Slovenia03/5/20Europe
40321Finland101/29/20Europe
\n", + "
" + ], + "text/plain": [ + " Confirmed Country/Region Recovered Date continent\n", + "92 236 Pakistan 2 2/26/20 Asia\n", + "97 238 Poland 13 3/4/20 Europe\n", + "109 266 Singapore 114 1/23/20 Asia\n", + "111 275 Slovenia 0 3/5/20 Europe\n", + "40 321 Finland 10 1/29/20 Europe" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df3 = last_confirmed_number.merge(first_date, on='Country/Region', how='inner')\n", + "df_final = df3.sort_values(by=['Confirmed', 'Date']).tail(20)\n", + "df_final.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #5: Convert dates to datetime and sort" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "df_final['Date'] = pd.to_datetime(df_final['Date'])" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "63 2020-01-22\n", + "109 2020-01-23\n", + "75 2020-01-25\n", + "44 2020-01-27\n", + "40 2020-01-29\n", + "61 2020-01-31\n", + "118 2020-01-31\n", + "114 2020-02-01\n", + "15 2020-02-04\n", + "60 2020-02-21\n", + "119 2020-02-25\n", + "9 2020-02-25\n", + "90 2020-02-26\n", + "92 2020-02-26\n", + "46 2020-02-26\n", + "19 2020-02-26\n", + "99 2020-02-29\n", + "98 2020-03-02\n", + "97 2020-03-04\n", + "111 2020-03-05\n", + "Name: Date, dtype: datetime64[ns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_final['Date'].sort_values()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #6: Plot Data as Scatterplot\n", + "* x axis - Current Active Cases\n", + "* y axis - First Date Confirmed\n", + "* size of points - Current Recovered " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "from matplotlib.pyplot import figure\n", + "\n", + "figure(num=None, figsize=(12, 8), dpi=100, facecolor='w', edgecolor='k')\n", + "\n", + "plt.scatter(df_final.Confirmed, df_final.Date, s=df_final.Recovered, alpha = 0.25)\n", + "\n", + "[plt.text( x=row['Confirmed'], y=row['Date'], s=row['Country/Region']) for k,row in df_final.iterrows()]\n", + "\n", + "plt.xlabel('Current Active Cases')\n", + "plt.ylabel('First Date Confirmed')\n", + "\n", + "axes = plt.gca()\n", + "axes.set_ylim(['2020-01-20','2020-03-10'])\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #7: Plot Data with continent colors" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "continent_colors = {'Europe':'red',\n", + " 'Africa':'green',\n", + " 'Americas':'blue',\n", + " 'Asia':'cyan',\n", + " 'Australia and New Zealand':'purple'}" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "from matplotlib.pyplot import figure\n", + "\n", + "figure(num=None, figsize=(12, 8), dpi=100, facecolor='w', edgecolor='k')\n", + "\n", + "plt.xlabel('Current Active Cases')\n", + "plt.ylabel('First Date Confirmed')\n", + "\n", + "for i,j in df_final.iterrows():\n", + " reg_color = continent_colors.get(j['continent'], 'black')\n", + " plt.scatter(df_final['Confirmed'][i], df_final['Date'][i], s=200, alpha = 0.25, color=reg_color)\n", + "\n", + " \n", + "[plt.text( x=row['Confirmed'], y=row['Date'], s=row['Country/Region']) for k,row in df_final.iterrows()] \n", + "axes = plt.gca()\n", + "axes.set_ylim(['2020-01-20','2020-03-10'])\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/26.pandas-display-all-columns-and-show-more-rows.ipynb b/notebooks/pandas/26.pandas-display-all-columns-and-show-more-rows.ipynb new file mode 100644 index 0000000..b495c54 --- /dev/null +++ b/notebooks/pandas/26.pandas-display-all-columns-and-show-more-rows.ipynb @@ -0,0 +1,2177 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 26. Pandas Display All Columns and Show More Rows" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(5043, 28)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
2ColorSam Mendes602.0148.00.0161.0Rory Kinnear11000.0200074175.0Action|Adventure|Thriller...994.0EnglishUKPG-13245000000.02015.0393.06.82.3585000
3ColorChristopher Nolan813.0164.022000.023000.0Christian Bale27000.0448130642.0Action|Thriller...2701.0EnglishUSAPG-13250000000.02012.023000.08.52.35164000
4NaNDoug WalkerNaNNaN131.0NaNRob Walker131.0NaNDocumentary...NaNNaNNaNNaNNaNNaN12.07.1NaN0
..................................................................
5038ColorScott Smith1.087.02.0318.0Daphne Zuniga637.0NaNComedy|Drama...6.0EnglishCanadaNaNNaN2013.0470.07.7NaN84
5039ColorNaN43.043.0NaN319.0Valorie Curry841.0NaNCrime|Drama|Mystery|Thriller...359.0EnglishUSATV-14NaNNaN593.07.516.0032000
5040ColorBenjamin Roberds13.076.00.00.0Maxwell Moody0.0NaNDrama|Horror|Thriller...3.0EnglishUSANaN1400.02013.00.06.3NaN16
5041ColorDaniel Hsia14.0100.00.0489.0Daniel Henney946.010443.0Comedy|Drama|Romance...9.0EnglishUSAPG-13NaN2012.0719.06.32.35660
5042ColorJon Gunn43.090.016.016.0Brian Herzlinger86.085222.0Documentary...84.0EnglishUSAPG1100.02004.023.06.61.85456
\n", + "

5043 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "2 Color Sam Mendes 602.0 148.0 \n", + "3 Color Christopher Nolan 813.0 164.0 \n", + "4 NaN Doug Walker NaN NaN \n", + "... ... ... ... ... \n", + "5038 Color Scott Smith 1.0 87.0 \n", + "5039 Color NaN 43.0 43.0 \n", + "5040 Color Benjamin Roberds 13.0 76.0 \n", + "5041 Color Daniel Hsia 14.0 100.0 \n", + "5042 Color Jon Gunn 43.0 90.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "2 0.0 161.0 Rory Kinnear \n", + "3 22000.0 23000.0 Christian Bale \n", + "4 131.0 NaN Rob Walker \n", + "... ... ... ... \n", + "5038 2.0 318.0 Daphne Zuniga \n", + "5039 NaN 319.0 Valorie Curry \n", + "5040 0.0 0.0 Maxwell Moody \n", + "5041 0.0 489.0 Daniel Henney \n", + "5042 16.0 16.0 Brian Herzlinger \n", + "\n", + " actor_1_facebook_likes gross genres \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy \n", + "2 11000.0 200074175.0 Action|Adventure|Thriller \n", + "3 27000.0 448130642.0 Action|Thriller \n", + "4 131.0 NaN Documentary \n", + "... ... ... ... \n", + "5038 637.0 NaN Comedy|Drama \n", + "5039 841.0 NaN Crime|Drama|Mystery|Thriller \n", + "5040 0.0 NaN Drama|Horror|Thriller \n", + "5041 946.0 10443.0 Comedy|Drama|Romance \n", + "5042 86.0 85222.0 Documentary \n", + "\n", + " ... num_user_for_reviews language country content_rating budget \\\n", + "0 ... 3054.0 English USA PG-13 237000000.0 \n", + "1 ... 1238.0 English USA PG-13 300000000.0 \n", + "2 ... 994.0 English UK PG-13 245000000.0 \n", + "3 ... 2701.0 English USA PG-13 250000000.0 \n", + "4 ... NaN NaN NaN NaN NaN \n", + "... ... ... ... ... ... ... \n", + "5038 ... 6.0 English Canada NaN NaN \n", + "5039 ... 359.0 English USA TV-14 NaN \n", + "5040 ... 3.0 English USA NaN 1400.0 \n", + "5041 ... 9.0 English USA PG-13 NaN \n", + "5042 ... 84.0 English USA PG 1100.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "0 2009.0 936.0 7.9 1.78 \n", + "1 2007.0 5000.0 7.1 2.35 \n", + "2 2015.0 393.0 6.8 2.35 \n", + "3 2012.0 23000.0 8.5 2.35 \n", + "4 NaN 12.0 7.1 NaN \n", + "... ... ... ... ... \n", + "5038 2013.0 470.0 7.7 NaN \n", + "5039 NaN 593.0 7.5 16.00 \n", + "5040 2013.0 0.0 6.3 NaN \n", + "5041 2012.0 719.0 6.3 2.35 \n", + "5042 2004.0 23.0 6.6 1.85 \n", + "\n", + " movie_facebook_likes \n", + "0 33000 \n", + "1 0 \n", + "2 85000 \n", + "3 164000 \n", + "4 0 \n", + "... ... \n", + "5038 84 \n", + "5039 32000 \n", + "5040 16 \n", + "5041 660 \n", + "5042 456 \n", + "\n", + "[5043 rows x 28 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
actor_1_facebook_likesgrossgenresactor_1_name
5640.073058679.0Action|Adventure|Sci-FiDaryl Sabara
624000.0336530303.0Action|Adventure|RomanceJ.K. Simmons
7799.0200807262.0Adventure|Animation|Comedy|Family|Fantasy|Musi...Brad Garrett
826000.0458991599.0Action|Adventure|Sci-FiChris Hemsworth
925000.0301956980.0Adventure|Family|Fantasy|MysteryAlan Rickman
\n", + "
" + ], + "text/plain": [ + " actor_1_facebook_likes gross \\\n", + "5 640.0 73058679.0 \n", + "6 24000.0 336530303.0 \n", + "7 799.0 200807262.0 \n", + "8 26000.0 458991599.0 \n", + "9 25000.0 301956980.0 \n", + "\n", + " genres actor_1_name \n", + "5 Action|Adventure|Sci-Fi Daryl Sabara \n", + "6 Action|Adventure|Romance J.K. Simmons \n", + "7 Adventure|Animation|Comedy|Family|Fantasy|Musi... Brad Garrett \n", + "8 Action|Adventure|Sci-Fi Chris Hemsworth \n", + "9 Adventure|Family|Fantasy|Mystery Alan Rickman " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.iloc[5:10,7:11]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #1: Display all columns and rows with Pandas options" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('display.max_rows', None)\n", + "pd.set_option('display.max_columns', None)\n", + "pd.set_option('display.width', None)\n", + "pd.set_option('display.max_colwidth', None)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.\n", + " \"\"\"Entry point for launching an IPython kernel.\n" + ] + } + ], + "source": [ + "pd.set_option('display.max_colwidth', -1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #2: Display more or all rows " + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "pd.reset_option('display.max_rows')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
genres
Drama236
Comedy209
Comedy|Drama191
Comedy|Drama|Romance187
Comedy|Romance158
......
Adventure|Animation|Comedy|Fantasy|Music|Romance1
Family|Fantasy|Music1
Action|Adventure|Drama|History|Romance|War1
Biography|Comedy|Crime|Drama|Romance1
Adventure|Comedy|Musical|Romance1
\n", + "

914 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + " genres\n", + "Drama 236 \n", + "Comedy 209 \n", + "Comedy|Drama 191 \n", + "Comedy|Drama|Romance 187 \n", + "Comedy|Romance 158 \n", + "... ... \n", + "Adventure|Animation|Comedy|Fantasy|Music|Romance 1 \n", + "Family|Fantasy|Music 1 \n", + "Action|Adventure|Drama|History|Romance|War 1 \n", + "Biography|Comedy|Crime|Drama|Romance 1 \n", + "Adventure|Comedy|Musical|Romance 1 \n", + "\n", + "[914 rows x 1 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.genres.value_counts(dropna=False).to_frame()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('display.max_rows', 100)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('display.max_rows', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Drama 236\n", + "Comedy 209\n", + "Comedy|Drama 191\n", + "Comedy|Drama|Romance 187\n", + "Comedy|Romance 158\n", + "Drama|Romance 152\n", + "Crime|Drama|Thriller 101\n", + "Horror 71 \n", + "Action|Crime|Drama|Thriller 68 \n", + "Action|Crime|Thriller 65 \n", + "Drama|Thriller 64 \n", + "Crime|Drama 63 \n", + "Horror|Thriller 56 \n", + "Crime|Drama|Mystery|Thriller 55 \n", + "Action|Adventure|Sci-Fi 51 \n", + "Comedy|Crime 51 \n", + "Documentary 51 \n", + "Action|Adventure|Thriller 46 \n", + "Drama|Mystery|Thriller 37 \n", + "Biography|Drama 35 \n", + "Action|Adventure|Sci-Fi|Thriller 35 \n", + "Horror|Mystery|Thriller 35 \n", + "Action|Comedy|Crime 30 \n", + "Action|Thriller 30 \n", + "Action|Adventure|Fantasy 30 \n", + "Horror|Mystery 29 \n", + "Adventure|Animation|Comedy|Family|Fantasy 29 \n", + "Drama|Music 27 \n", + "Drama|Sport 26 \n", + "Comedy|Family 26 \n", + "Biography|Drama|History 26 \n", + "Biography|Drama|Sport 26 \n", + "Adventure|Animation|Comedy|Family 24 \n", + "Comedy|Crime|Drama 24 \n", + "Drama|War 23 \n", + "Action|Comedy|Crime|Thriller 23 \n", + "Action|Sci-Fi 23 \n", + "Action|Drama|Thriller 22 \n", + "Mystery|Thriller 22 \n", + "Action|Crime|Drama|Mystery|Thriller 22 \n", + "Drama|History|War 21 \n", + "Drama|Horror|Mystery|Thriller 21 \n", + "Comedy|Family|Fantasy 20 \n", + "Adventure|Family|Fantasy 20 \n", + "Thriller 20 \n", + "Drama|Music|Romance 19 \n", + "Crime|Thriller 18 \n", + "Horror|Sci-Fi|Thriller 18 \n", + "Comedy|Drama|Music 18 \n", + "Comedy|Horror 18 \n", + "Fantasy|Horror 18 \n", + "Drama|Family 18 \n", + "Biography|Drama|Romance 17 \n", + "Comedy|Music 17 \n", + "Action|Sci-Fi|Thriller 16 \n", + "Crime|Drama|Romance|Thriller 16 \n", + "Comedy|Sport 16 \n", + "Biography|Crime|Drama 16 \n", + "Comedy|Fantasy 16 \n", + "Crime|Mystery|Thriller 15 \n", + "Comedy|Drama|Family 15 \n", + "Action|Comedy 15 \n", + "Comedy|Crime|Thriller 15 \n", + "Action|Horror|Sci-Fi|Thriller 15 \n", + "Drama|Sci-Fi|Thriller 15 \n", + "Drama|Fantasy|Romance 14 \n", + "Comedy|Drama|Romance|Sport 14 \n", + "Drama|Romance|War 14 \n", + "Adventure|Comedy 14 \n", + "Action|Adventure|Fantasy|Sci-Fi 13 \n", + "Crime|Drama|Romance 13 \n", + "Comedy|Fantasy|Romance 13 \n", + "Comedy|Family|Romance 13 \n", + "Drama|Horror|Thriller 13 \n", + "Adventure|Drama 13 \n", + "Action|Adventure|Drama|Thriller 13 \n", + "Drama|Mystery|Sci-Fi|Thriller 13 \n", + "Biography|Drama|Music 13 \n", + "Drama|Mystery|Romance|Thriller 12 \n", + "Action|Adventure|Comedy 12 \n", + "Action|Adventure|Fantasy|Sci-Fi|Thriller 12 \n", + "Adventure|Comedy|Family|Fantasy 12 \n", + "Adventure|Animation|Family|Fantasy 12 \n", + "Western 12 \n", + "Action|Crime|Sci-Fi|Thriller 11 \n", + "Adventure|Comedy|Sci-Fi 11 \n", + "Biography|Drama|History|Romance 11 \n", + "Drama|Horror|Sci-Fi|Thriller 11 \n", + "Comedy|Fantasy|Horror 11 \n", + "Action|Adventure 11 \n", + "Action 11 \n", + "Comedy|Crime|Romance 11 \n", + "Comedy|Drama|Fantasy|Romance 11 \n", + "Documentary|Music 10 \n", + "Action|Drama 10 \n", + "Drama|Mystery 10 \n", + "Action|Mystery|Thriller 10 \n", + "Drama|History 10 \n", + "Action|Horror|Thriller 10 \n", + "Drama|Sci-Fi 10 \n", + "Action|Drama|War 10 \n", + "Crime|Drama|Mystery 10 \n", + "Drama|Romance|Sci-Fi 10 \n", + "Drama|Fantasy 10 \n", + "Fantasy|Horror|Thriller 10 \n", + "Sci-Fi|Thriller 10 \n", + "Action|Horror|Sci-Fi 9 \n", + "Animation|Comedy|Family 9 \n", + "Comedy|Music|Romance 9 \n", + "Horror|Sci-Fi 9 \n", + "Drama|Musical|Romance 9 \n", + "Action|Adventure|Crime|Thriller 9 \n", + "Action|Drama|Sci-Fi|Thriller 9 \n", + "Fantasy|Horror|Mystery|Thriller 9 \n", + "Comedy|Sci-Fi 9 \n", + "Adventure|Fantasy 9 \n", + "Comedy|Drama|Music|Romance 9 \n", + "Drama|History|Thriller 9 \n", + "Comedy|Horror|Sci-Fi 8 \n", + "Action|Comedy|Sci-Fi 8 \n", + "Drama|Western 8 \n", + "Adventure|Drama|Romance 8 \n", + "Comedy|Crime|Drama|Thriller 8 \n", + "Action|Drama|History|War 8 \n", + "Adventure|Comedy|Drama 8 \n", + "Action|Adventure|Comedy|Family|Sci-Fi 8 \n", + "Adventure|Comedy|Family 8 \n", + "Drama|Romance|Western 8 \n", + "Comedy|Fantasy|Horror|Thriller 8 \n", + "Action|Adventure|Drama 7 \n", + "Biography|Drama|Thriller 7 \n", + "Action|Adventure|Drama|Romance 7 \n", + "Adventure|Drama|Thriller 7 \n", + "Adventure|Sci-Fi|Thriller 7 \n", + "Mystery|Sci-Fi|Thriller 7 \n", + "Action|Adventure|Drama|History|War 7 \n", + "Comedy|Documentary 7 \n", + "Comedy|Musical|Romance 7 \n", + "Adventure|Animation|Comedy|Family|Sci-Fi 7 \n", + "Action|Drama|Thriller|War 7 \n", + "Action|Crime|Mystery|Thriller 7 \n", + "Comedy|Romance|Sport 7 \n", + "Adventure|Drama|History 7 \n", + "Action|Comedy|Crime|Drama|Thriller 7 \n", + "Action|Adventure|Animation|Comedy|Family 6 \n", + "Action|Adventure|Drama|Fantasy 6 \n", + "Comedy|Drama|Sport 6 \n", + "Biography|Drama|History|War 6 \n", + "Action|Comedy|Romance 6 \n", + "Crime|Horror|Thriller 6 \n", + "Action|Crime|Drama|Romance|Thriller 6 \n", + "Animation|Comedy|Family|Fantasy 6 \n", + "Comedy|Horror|Thriller 6 \n", + "Action|Adventure|Fantasy|Thriller 6 \n", + "Drama|Romance|Thriller 6 \n", + "Comedy|Drama|Family|Romance 6 \n", + "Action|Adventure|Western 6 \n", + "Action|Mystery|Sci-Fi|Thriller 6 \n", + "Biography|Crime|Drama|Thriller 6 \n", + "Action|Adventure|Comedy|Crime 6 \n", + "Comedy|Crime|Mystery 6 \n", + "Drama|History|Romance|War 6 \n", + "Drama|Fantasy|Horror|Thriller 6 \n", + "Horror|Mystery|Sci-Fi|Thriller 6 \n", + "Drama|Romance|Sport 6 \n", + "Crime|Drama|Horror|Thriller 5 \n", + "Action|Horror 5 \n", + "Comedy|Drama|Musical|Romance 5 \n", + "Adventure|Drama|Family|Fantasy 5 \n", + "Action|Adventure|Comedy|Thriller 5 \n", + "Comedy|Family|Sci-Fi 5 \n", + "Action|Adventure|Comedy|Sci-Fi 5 \n", + "Documentary|Sport 5 \n", + "Comedy|Romance|Sci-Fi 5 \n", + "Action|Adventure|Drama|Sci-Fi 5 \n", + "Crime|Horror|Mystery|Thriller 5 \n", + "Biography|Comedy|Drama 5 \n", + "Drama|Horror 5 \n", + "Action|Adventure|Family|Fantasy 5 \n", + "Biography|Drama|Music|Musical 5 \n", + "Action|Adventure|Romance 5 \n", + "Adventure|Drama|Romance|War 5 \n", + "Family 5 \n", + "Adventure|Animation|Family 5 \n", + "Comedy|Musical 5 \n", + "Drama|History|Thriller|War 5 \n", + "Action|Fantasy|Horror|Thriller 5 \n", + "Comedy|Drama|Family|Sport 5 \n", + "Comedy|Family|Fantasy|Romance 5 \n", + "Adventure|Family 5 \n", + "Adventure|Horror|Thriller 5 \n", + "Action|Crime|Drama 5 \n", + "Drama|Mystery|Sci-Fi 5 \n", + "Action|Drama|Sport 5 \n", + "Adventure|Family|Fantasy|Mystery 5 \n", + "Comedy|Drama|Fantasy 4 \n", + "Action|Adventure|Romance|Sci-Fi 4 \n", + "Drama|Family|Fantasy|Romance 4 \n", + "Comedy|Western 4 \n", + "Comedy|Drama|War 4 \n", + "Action|Comedy|Horror 4 \n", + "Adventure|Biography|Drama|History|War 4 \n", + "Documentary|War 4 \n", + "Adventure|Mystery|Sci-Fi 4 \n", + "Drama|Mystery|Romance 4 \n", + "Action|Adventure|Animation|Comedy|Family|Fantasy 4 \n", + "Romance 4 \n", + "Drama|Musical 4 \n", + "Comedy|Drama|Family|Music|Musical|Romance 4 \n", + "Drama|Family|Romance 4 \n", + "Action|Adventure|Romance|Sci-Fi|Thriller 4 \n", + "Action|Comedy|Fantasy|Sci-Fi 4 \n", + "Biography|Drama|War 4 \n", + "Adventure|Drama|Fantasy|Romance 4 \n", + "Drama|Fantasy|Horror 4 \n", + "Drama|Family|Sport 4 \n", + "Action|Crime 4 \n", + "Adventure|Drama|Family 4 \n", + "Adventure|Drama|Western 4 \n", + "Action|Fantasy|Thriller 4 \n", + "Action|Fantasy|Horror 4 \n", + "Adventure|Drama|Sci-Fi|Thriller 4 \n", + "Drama|History|Sport 4 \n", + "Action|Comedy|Thriller 4 \n", + "Action|Adventure|History 4 \n", + "Comedy|Drama|Sci-Fi 4 \n", + "Action|Adventure|Comedy|Family|Fantasy|Sci-Fi 4 \n", + "Biography|Drama|Music|Romance 4 \n", + "Adventure|Drama|Sci-Fi 4 \n", + "Comedy|Crime|Drama|Romance 4 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Sci-Fi 4 \n", + "Biography|Documentary|Music 4 \n", + "Adventure|Comedy|Drama|Romance 4 \n", + "Drama|Horror|Sci-Fi 4 \n", + "Action|Adventure|Horror|Sci-Fi 4 \n", + "Adventure|Biography|Drama 3 \n", + "Adventure|Animation|Comedy|Family|Sport 3 \n", + "Crime|Drama|Horror|Mystery|Thriller 3 \n", + "Action|Adventure|Mystery|Sci-Fi|Thriller 3 \n", + "Action|Comedy|Fantasy 3 \n", + "Adventure 3 \n", + "Adventure|Comedy|Family|Romance 3 \n", + "Action|Crime|Fantasy|Thriller 3 \n", + "Drama|Family|Music 3 \n", + "Action|Crime|Romance|Thriller 3 \n", + "Musical|Romance 3 \n", + "Drama|Fantasy|Horror|Mystery|Thriller 3 \n", + "Drama|Music|Musical|Romance 3 \n", + "Action|Biography|Crime|Drama 3 \n", + "Biography|Comedy|Crime|Drama 3 \n", + "Drama|Fantasy|Thriller 3 \n", + "Comedy|Crime|Romance|Thriller 3 \n", + "Action|Animation|Comedy|Family|Sci-Fi 3 \n", + "Adventure|Animation|Comedy|Family|Romance 3 \n", + "Adventure|Family|Fantasy|Musical 3 \n", + "Adventure|Comedy|Fantasy 3 \n", + "Adventure|Animation|Comedy|Family|Musical 3 \n", + "Action|Adventure|Fantasy|Romance 3 \n", + "Action|Crime|Drama|Thriller|Western 3 \n", + "Fantasy|Horror|Mystery 3 \n", + "Action|Comedy|Sport 3 \n", + "Adventure|Comedy|Drama|Family|Fantasy 3 \n", + "Action|Adventure|Comedy|Fantasy 3 \n", + "Drama|Family|Musical|Romance 3 \n", + "Action|Biography|Drama|Sport 3 \n", + "Biography|Comedy|Drama|History 3 \n", + "Adventure|Drama|Fantasy 3 \n", + "Sci-Fi 3 \n", + "Horror|Mystery|Sci-Fi 3 \n", + "Comedy|Fantasy|Sci-Fi 3 \n", + "Drama|Horror|Mystery 3 \n", + "Drama|Fantasy|Mystery|Thriller 3 \n", + "Drama|Family|Fantasy 3 \n", + "Drama|Fantasy|Romance|Sci-Fi 3 \n", + "Comedy|Crime|Music 3 \n", + "Action|Adventure|Drama|History|Romance 3 \n", + "Comedy|Drama|Family|Fantasy 3 \n", + "Biography|Drama|Romance|Sport 3 \n", + "Documentary|Drama 3 \n", + "Adventure|Biography|Drama|Thriller 3 \n", + "Fantasy|Romance 3 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Musical 3 \n", + "Action|Romance|Thriller 3 \n", + "Comedy|War 3 \n", + "Action|Drama|Romance 3 \n", + "Comedy|Crime|Drama|Romance|Thriller 3 \n", + "Action|Adventure|Comedy|Romance|Sci-Fi 3 \n", + "Action|Adventure|Animation|Comedy|Family|Sci-Fi 3 \n", + "Action|Adventure|Horror|Sci-Fi|Thriller 3 \n", + "Comedy|Crime|Drama|Mystery|Romance 3 \n", + "Drama|Thriller|War 3 \n", + "Action|Comedy|Crime|Romance|Thriller 3 \n", + "Comedy|Mystery|Romance 2 \n", + "Animation|Comedy 2 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Romance 2 \n", + "Drama|Romance|Sci-Fi|Thriller 2 \n", + "Adventure|Comedy|Family|Fantasy|Sci-Fi 2 \n", + "Adventure|Animation|Comedy|Drama|Family|Musical 2 \n", + "Comedy|Horror|Musical 2 \n", + "Action|Adventure|Comedy|Western 2 \n", + "Comedy|Mystery 2 \n", + "Action|Comedy|Family 2 \n", + "Action|Drama|Western 2 \n", + "Action|Comedy|War 2 \n", + "Biography|Drama|History|Sport 2 \n", + "Action|Adventure|Crime|Mystery|Thriller 2 \n", + "Biography|Comedy|Drama|Sport 2 \n", + "Action|Drama|History|Romance|War 2 \n", + "Crime|Drama|Western 2 \n", + "Adventure|Comedy|Family|Sport 2 \n", + "Adventure|Mystery|Thriller 2 \n", + "Adventure|Comedy|Drama|Fantasy 2 \n", + "Comedy|Drama|Family|Music|Romance 2 \n", + "Adventure|Comedy|Mystery 2 \n", + "Animation 2 \n", + "Comedy|Horror|Mystery 2 \n", + "Biography|Drama|Romance|War 2 \n", + "Action|Comedy|Horror|Sci-Fi 2 \n", + "Adventure|Drama|Family|Fantasy|Sci-Fi 2 \n", + "Animation|Comedy|Family|Sci-Fi 2 \n", + "Comedy|Drama|Music|Musical 2 \n", + "Crime|Drama|Sport 2 \n", + "Adventure|Comedy|Romance 2 \n", + "Comedy|Drama|Romance|Sci-Fi 2 \n", + "Adventure|Fantasy|Mystery|Thriller 2 \n", + "Adventure|Family|Fantasy|Romance 2 \n", + "Animation|Comedy|Family|Fantasy|Music 2 \n", + "Biography|Drama|Sport|War 2 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Musical|Romance 2 \n", + "Action|Drama|Fantasy|War 2 \n", + "Adventure|Animation|Family|Sci-Fi 2 \n", + "Action|Adventure|Thriller|War 2 \n", + "Action|Comedy|Crime|Drama 2 \n", + "Comedy|Horror|Romance 2 \n", + "Drama|Horror|Romance|Thriller 2 \n", + "Animation|Family|Fantasy|Music 2 \n", + "Biography|Comedy|Drama|Romance 2 \n", + "Documentary|History|Music 2 \n", + "Drama|Horror|Mystery|Sci-Fi|Thriller 2 \n", + "Action|Adventure|Drama|Romance|War 2 \n", + "Comedy|Drama|Horror|Romance 2 \n", + "Biography|Comedy|Romance 2 \n", + "Action|Biography|Drama|History 2 \n", + "Adventure|Drama|Mystery 2 \n", + "Action|Adventure|Crime|Drama|Thriller 2 \n", + "Action|Adventure|Drama|Sci-Fi|Thriller 2 \n", + "Biography|Drama|Thriller|War 2 \n", + "Comedy|Family|Sport 2 \n", + "Fantasy 2 \n", + "Action|Drama|Fantasy|Romance 2 \n", + "Action|Adventure|Animation|Family|Fantasy|Sci-Fi 2 \n", + "Adventure|Comedy|Fantasy|Sci-Fi 2 \n", + "Action|Adventure|Animation|Family|Sci-Fi 2 \n", + "Animation|Comedy|Family|Mystery|Sci-Fi 2 \n", + "Action|Comedy|Crime|Romance 2 \n", + "Adventure|Animation|Fantasy 2 \n", + "Comedy|Drama|Thriller 2 \n", + "Action|Drama|Sci-Fi 2 \n", + "Action|Fantasy|Horror|Sci-Fi|Thriller 2 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Music 2 \n", + "Fantasy|Horror|Sci-Fi 2 \n", + "Action|Comedy|Crime|Fantasy 2 \n", + "Animation|Family 2 \n", + "Action|Adventure|Animation|Comedy|Family|Fantasy|Sci-Fi 2 \n", + "Biography|Drama|History|Thriller 2 \n", + "Action|Drama|Fantasy|Mystery|Thriller 2 \n", + "Comedy|Drama|Family|Fantasy|Romance 2 \n", + "Animation|Comedy|Drama 2 \n", + "Action|Comedy|Documentary 2 \n", + "Action|Adventure|Drama|History 2 \n", + "Crime|Drama|Music 2 \n", + "Adventure|Drama|War 2 \n", + "Action|Comedy|Romance|Thriller 2 \n", + "Comedy|Fantasy|Horror|Romance 2 \n", + "Biography|Crime|Drama|Romance 2 \n", + "Crime|Romance|Thriller 2 \n", + "Adventure|Animation|Comedy 2 \n", + "Action|Adventure|Animation|Comedy|Drama|Family|Sci-Fi 2 \n", + "Action|Fantasy 2 \n", + "Comedy|Romance|Thriller 2 \n", + "Crime|Documentary 2 \n", + "Action|Adventure|Family|Sci-Fi 2 \n", + "Adventure|Horror 2 \n", + "Comedy|Drama|Musical 2 \n", + "Action|Adventure|Comedy|Romance 2 \n", + "Action|Comedy|Family|Fantasy 2 \n", + "Action|Drama|Mystery|Sci-Fi 2 \n", + "Drama|Fantasy|Musical|Romance 2 \n", + "Comedy|Drama|Horror|Sci-Fi 2 \n", + "Action|Adventure|Mystery|Sci-Fi 2 \n", + "Action|Crime|Mystery|Romance|Thriller 2 \n", + "Comedy|Crime|Family 2 \n", + "Mystery|Romance|Thriller 2 \n", + "Drama|Fantasy|Horror|Mystery 2 \n", + "Action|Drama|Fantasy 2 \n", + "Crime|Documentary|War 2 \n", + "Action|Crime|Drama|Sci-Fi|Thriller 2 \n", + "Comedy|Drama|Romance|Thriller 2 \n", + "Documentary|History 2 \n", + "Animation|Family|Fantasy|Musical 2 \n", + "Action|Drama|Family|Sport 2 \n", + "Adventure|Comedy|Family|Musical 2 \n", + "Action|Drama|Horror|Thriller 2 \n", + "Biography|Crime|Drama|History 2 \n", + "Action|Adventure|Horror|Thriller 2 \n", + "Family|Sci-Fi 2 \n", + "Animation|Comedy|Family|Fantasy|Musical 2 \n", + "Action|Sci-Fi|Sport 2 \n", + "Action|Adventure|Drama|Horror|Sci-Fi 2 \n", + "Action|Adventure|Animation|Family|Fantasy 2 \n", + "Adventure|Animation|Comedy|Drama|Family 2 \n", + "Biography|Documentary 2 \n", + "Action|Comedy|Sci-Fi|Thriller 2 \n", + "Action|Crime|Sport|Thriller 2 \n", + "Action|Comedy|Drama|Thriller 2 \n", + "Drama|Mystery|Romance|War 2 \n", + "Drama|History|War|Western 2 \n", + "Drama|Romance|War|Western 2 \n", + "Adventure|Comedy|Family|Fantasy|Horror 2 \n", + "Adventure|Animation|Comedy|Drama|Family|Fantasy|Musical 1 \n", + "History 1 \n", + "Adventure|Animation|Comedy|Fantasy|Romance 1 \n", + "Animation|Comedy|Family|Musical 1 \n", + "Game-Show|Reality-TV|Romance 1 \n", + "Adventure|Comedy|Drama|Fantasy|Romance 1 \n", + "Adventure|Fantasy|Horror|Mystery|Thriller 1 \n", + "Comedy|Romance|Sci-Fi|Thriller 1 \n", + "Comedy|Horror|Mystery|Thriller 1 \n", + "Adventure|Comedy|History|Romance 1 \n", + "Biography|Comedy|Drama|Music 1 \n", + "Comedy|Drama|Music|Musical|Romance 1 \n", + "Action|Adventure|Animation|Comedy|Fantasy|Sci-Fi 1 \n", + "Adventure|Biography|Drama|Romance 1 \n", + "Adventure|Animation|Drama|Family|Musical 1 \n", + "Drama|Fantasy|Horror|Romance 1 \n", + "Biography|Crime|Drama|Western 1 \n", + "Adventure|Family|Fantasy|Horror|Mystery 1 \n", + "Comedy|Mystery|Sci-Fi|Thriller 1 \n", + "Adventure|Animation|Fantasy|Horror|Sci-Fi 1 \n", + "Comedy|Crime|Drama|Horror|Mystery|Thriller 1 \n", + "Action|Drama|Fantasy|Sci-Fi 1 \n", + "Action|Biography|Drama|History|War 1 \n", + "Comedy|Drama|Mystery|Romance|Thriller 1 \n", + "Drama|Mystery|Romance|Thriller|War 1 \n", + "Adventure|Animation|Family|Musical 1 \n", + "Action|Crime|Drama|Western 1 \n", + "Adventure|Drama|Thriller|Western 1 \n", + "Action|Animation|Comedy|Sci-Fi 1 \n", + "Adventure|Drama|Family|Romance|Western 1 \n", + "Romance|Short 1 \n", + "Adventure|Animation|Comedy|Crime|Family 1 \n", + "Adventure|Fantasy|Mystery 1 \n", + "Drama|Family|Music|Musical 1 \n", + "Romance|Sci-Fi|Thriller 1 \n", + "Drama|Music|Mystery|Romance 1 \n", + "Adventure|Drama|History|War 1 \n", + "Comedy|Fantasy|Thriller 1 \n", + "Adventure|Comedy|Family|Fantasy|Horror|Mystery 1 \n", + "Action|Drama|History|Thriller 1 \n", + "Animation|Comedy|Family|Horror|Sci-Fi 1 \n", + "Biography|Crime|Documentary|History 1 \n", + "Adventure|Animation|Drama|Family|History|Musical|Romance 1 \n", + "Thriller|Western 1 \n", + "Comedy|Drama|Family|Musical 1 \n", + "Comedy|Crime|Drama|Thriller|War 1 \n", + "Animation|Comedy|Family|Romance 1 \n", + "Comedy|Family|Fantasy|Musical 1 \n", + "Comedy|Documentary|Drama 1 \n", + "Adventure|Comedy|Crime|Family|Mystery 1 \n", + "Action|Drama|History|Thriller|War 1 \n", + "Comedy|Crime|Musical 1 \n", + "Animation|Drama|Family|Fantasy 1 \n", + "Action|Adventure|Drama|Thriller|Western 1 \n", + "Crime|Drama|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Drama|Romance|Thriller 1 \n", + "Action|Comedy|Drama|Family|Thriller 1 \n", + "Action|Adventure|Drama|Romance|Western 1 \n", + "Comedy|Drama|Romance|War 1 \n", + "Biography|Crime|Drama|Romance|Thriller 1 \n", + "Adventure|Comedy|Crime|Drama|Family 1 \n", + "Comedy|Crime|Family|Sci-Fi 1 \n", + "Drama|Mystery|War 1 \n", + "Action|Adventure|Biography|Drama|History 1 \n", + "Action|Adventure|Family|Thriller 1 \n", + "Drama|Music|Musical 1 \n", + "Comedy|Crime|Musical|Romance 1 \n", + "Crime|Drama|Fantasy|Romance 1 \n", + "Action|Adventure|Crime|Fantasy|Mystery|Thriller 1 \n", + "Drama|Fantasy|War 1 \n", + "Action|Animation|Fantasy|Horror|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Fantasy|War 1 \n", + "Comedy|Drama|History|Romance 1 \n", + "Action|Adventure|Romance|War 1 \n", + "Fantasy|Mystery|Romance|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Comedy|Family|Romance|Sci-Fi 1 \n", + "Comedy|History 1 \n", + "Adventure|Comedy|Family|Romance|Sci-Fi 1 \n", + "Adventure|Animation|Family|Fantasy|Musical 1 \n", + "Comedy|Crime|Sport 1 \n", + "Thriller|War 1 \n", + "Drama|Music|Romance|War 1 \n", + "Biography|Crime|Drama|History|Romance 1 \n", + "Comedy|Mystery|Thriller 1 \n", + "Biography|Crime|Drama|Music 1 \n", + "Action|Crime|Drama|Sport 1 \n", + "Drama|Fantasy|Romance|Thriller 1 \n", + "Drama|Film-Noir|Mystery|Thriller 1 \n", + "Action|Comedy|Drama 1 \n", + "Drama|War|Western 1 \n", + "Film-Noir|Mystery|Romance|Thriller 1 \n", + "Action|Horror|Mystery|Sci-Fi|Thriller 1 \n", + "Adventure|Crime|Drama|Romance 1 \n", + "Biography|Comedy|Drama|Music|Romance 1 \n", + "Drama|Music|Mystery|Romance|Sci-Fi 1 \n", + "Biography|Documentary|Sport 1 \n", + "Adventure|Animation|Comedy|Family|Western 1 \n", + "Action|Comedy|Crime|Music 1 \n", + "Action|Adventure|Comedy|Crime|Family|Romance|Thriller 1 \n", + "Action|Comedy|Drama|Music 1 \n", + "Animation|Biography|Documentary|Drama|History|War 1 \n", + "Fantasy|Horror|Romance|Thriller 1 \n", + "Action|Drama|Romance|Sci-Fi|Thriller 1 \n", + "Action|Comedy|Crime|Fantasy|Horror|Mystery|Sci-Fi|Thriller 1 \n", + "Comedy|Crime|Family|Mystery|Romance|Thriller 1 \n", + "Adventure|Comedy|Romance|Sci-Fi 1 \n", + "Comedy|Drama|Horror 1 \n", + "Crime|Drama|History|Romance 1 \n", + "Action|Crime|Drama|Thriller|War 1 \n", + "Action|Crime|Drama|History|Western 1 \n", + "Adventure|Biography|Drama|History|Sport|Thriller 1 \n", + "Comedy|Drama|Fantasy|Horror 1 \n", + "Adventure|Animation|Family|Sport 1 \n", + "Action|Adventure|Drama|Mystery 1 \n", + "Animation|Comedy|Fantasy 1 \n", + "Crime|Film-Noir|Thriller 1 \n", + "Documentary|Drama|War 1 \n", + "Adventure|Crime|Drama|Mystery|Western 1 \n", + "Animation|Comedy|Fantasy|Musical 1 \n", + "Action|Adventure|Comedy|Family|Fantasy|Mystery|Sci-Fi 1 \n", + "Biography|Comedy|Drama|History|Music|Musical 1 \n", + "Action|Adventure|Drama|Thriller|War 1 \n", + "Adventure|Comedy|Sport 1 \n", + "Biography|Drama|History|Music 1 \n", + "Comedy|Family|Music|Musical 1 \n", + "Animation|Comedy|Family|Music|Western 1 \n", + "Drama|Fantasy|Sci-Fi 1 \n", + "Action|Biography|Drama|History|Romance|Western 1 \n", + "Biography|Crime|Drama|History|Thriller 1 \n", + "Action|Adventure|Comedy|Music|Thriller 1 \n", + "Biography|Drama|Fantasy|History 1 \n", + "Animation|Family|Fantasy 1 \n", + "Drama|Fantasy|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Comedy|Family|Romance 1 \n", + "Action|Drama|Fantasy|Horror|War 1 \n", + "Comedy|Drama|Romance|Western 1 \n", + "Animation|Drama|Family|Fantasy|Musical|Romance 1 \n", + "Action|Fantasy|Romance|Sci-Fi 1 \n", + "Adventure|Drama|History|Romance 1 \n", + "Action|Biography|Drama 1 \n", + "Action|Adventure|Comedy|Drama|Thriller 1 \n", + "Comedy|Short 1 \n", + "Action|Adventure|Comedy|Crime|Mystery|Thriller 1 \n", + "Adventure|Comedy|Drama|Romance|Sci-Fi 1 \n", + "Adventure|Comedy|Family|Mystery|Sci-Fi 1 \n", + "Action|Adventure|Comedy|Sci-Fi|Thriller 1 \n", + "Action|Drama|Fantasy|Thriller|Western 1 \n", + "Biography|Comedy|Drama|Family|Sport 1 \n", + "Action|Adventure|Crime|Drama|Mystery|Thriller 1 \n", + "Action|Animation|Comedy|Family|Fantasy|Sci-Fi 1 \n", + "Action|Adventure|Comedy|Family|Mystery 1 \n", + "Adventure|Family|Romance 1 \n", + "Adventure|Comedy|Fantasy|Music|Sci-Fi 1 \n", + "Drama|Musical|Romance|Thriller 1 \n", + "Crime|Documentary|News 1 \n", + "Comedy|Drama|Reality-TV|Romance 1 \n", + "Action|Drama|Fantasy|Horror|Thriller 1 \n", + "Drama|History|Music|Romance|War 1 \n", + "Action|Crime|Horror|Sci-Fi|Thriller 1 \n", + "Comedy|Family|Musical|Romance 1 \n", + "Action|Comedy|Horror|Thriller 1 \n", + "Comedy|Family|Romance|Sci-Fi 1 \n", + "Action|Adventure|Romance|Thriller 1 \n", + "Animation|Drama|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Family|Fantasy|Musical 1 \n", + "Adventure|Crime|Drama 1 \n", + "Action|Adventure|Animation|Drama|Mystery|Sci-Fi|Thriller 1 \n", + "Comedy|Drama|Mystery|Romance|Thriller|War 1 \n", + "Drama|Horror|Romance 1 \n", + "Action|Sci-Fi|War 1 \n", + "Action|Drama|Romance|Thriller 1 \n", + "Action|Comedy|Drama|Western 1 \n", + "Crime|Horror|Music|Thriller 1 \n", + "Documentary|Drama|Sport 1 \n", + "Family|Fantasy|Musical 1 \n", + "Biography|Crime|Documentary|History|Thriller 1 \n", + "Adventure|Drama|History|Romance|War 1 \n", + "Horror|Musical 1 \n", + "Horror|Musical|Sci-Fi 1 \n", + "Animation|Biography|Drama|War 1 \n", + "Action|Adventure|Fantasy|Horror|Sci-Fi|Thriller 1 \n", + "Comedy|Crime|Drama|Horror|Thriller 1 \n", + "Comedy|Sci-Fi|Thriller 1 \n", + "Comedy|Drama|Music|War 1 \n", + "Crime|Drama|Horror 1 \n", + "Drama|History|Horror 1 \n", + "Crime|Drama|Mystery|Romance|Thriller 1 \n", + "Drama|Fantasy|Romance|War 1 \n", + "Adventure|Animation|Family|Thriller 1 \n", + "Adventure|Horror|Mystery 1 \n", + "Mystery|Romance|Sci-Fi|Thriller 1 \n", + "Documentary|History|Sport 1 \n", + "Crime|Documentary|Drama 1 \n", + "Comedy|Thriller 1 \n", + "Action|Fantasy|Horror|Sci-Fi 1 \n", + "Adventure|Drama|Romance|Western 1 \n", + "Action|Adventure|Fantasy|Horror|Sci-Fi 1 \n", + "Action|Animation|Comedy|Family|Fantasy 1 \n", + "Adventure|Comedy|Western 1 \n", + "Action|Thriller|Western 1 \n", + "Action|Crime|Horror|Thriller 1 \n", + "Comedy|Crime|Family|Romance 1 \n", + "Crime|Drama|Music|Romance 1 \n", + "Drama|Family|Music|Romance 1 \n", + "Adventure|Comedy|Drama|Family|Sport 1 \n", + "Adventure|Documentary 1 \n", + "Biography|Comedy|Drama|Family|Romance 1 \n", + "Action|Adventure|Animation|Comedy|Sci-Fi 1 \n", + "Horror|Romance|Sci-Fi 1 \n", + "Action|Adventure|Romance|Western 1 \n", + "Action|Adventure|Animation|Comedy|Crime|Family|Fantasy 1 \n", + "Adventure|Comedy|Family|Fantasy|Musical 1 \n", + "Comedy|Drama|Musical|Romance|War 1 \n", + "Action|Adventure|Comedy|Family 1 \n", + "Biography|Crime|Drama|History|Music 1 \n", + "Action|Adventure|Comedy|Drama|War 1 \n", + "Action|Adventure|Fantasy|Horror|Thriller 1 \n", + "Action|Adventure|Drama|Fantasy|Sci-Fi 1 \n", + "Action|Adventure|Comedy|Fantasy|Romance 1 \n", + "Adventure|Comedy|Family|Fantasy|Music|Sci-Fi 1 \n", + "Comedy|Documentary|Music 1 \n", + "Adventure|Animation|Drama|Family|Fantasy|Musical|Mystery|Romance 1 \n", + "Musical 1 \n", + "Adventure|Comedy|Family|Sci-Fi 1 \n", + "Adventure|Comedy|Music|Sci-Fi 1 \n", + "Family|Music|Romance 1 \n", + "Action|Crime|Fantasy|Romance|Thriller 1 \n", + "Comedy|Family|Fantasy|Sci-Fi 1 \n", + "Family|Musical 1 \n", + "Action|Comedy|Sci-Fi|Western 1 \n", + "Adventure|Drama|Mystery|Sci-Fi|Thriller 1 \n", + "Adventure|Animation|Comedy|Fantasy 1 \n", + "Comedy|Fantasy|Horror|Musical 1 \n", + "Action|Animation|Comedy|Crime|Family 1 \n", + "Comedy|Drama|Musical|Romance|Western 1 \n", + "Comedy|Crime|Mystery|Romance 1 \n", + "Action|Comedy|Crime|Family 1 \n", + "Action|Horror|Romance|Sci-Fi|Thriller 1 \n", + "Action|Western 1 \n", + "Biography|Crime|Drama|War 1 \n", + "Crime|Drama|Mystery|Sci-Fi|Thriller 1 \n", + "Adventure|Comedy|History 1 \n", + "Comedy|Family|Music 1 \n", + "Comedy|Crime|Drama|Mystery|Thriller 1 \n", + "Adventure|Crime|Thriller 1 \n", + "Crime|Horror 1 \n", + "Action|Adventure|Comedy|Fantasy|Sci-Fi 1 \n", + "Comedy|Horror|Musical|Sci-Fi 1 \n", + "Adventure|Drama|Fantasy|Thriller|Western 1 \n", + "Drama|Family|History|Musical 1 \n", + "Action|Crime|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Drama|Music|Romance 1 \n", + "Adventure|Biography 1 \n", + "Action|Adventure|Comedy|Fantasy|Thriller 1 \n", + "Adventure|Biography|Drama|Horror|Thriller 1 \n", + "Action|Adventure|Crime|Drama 1 \n", + "Action|Adventure|Crime|Drama|Romance 1 \n", + "Comedy|Crime|Horror|Thriller 1 \n", + "Action|Drama|History 1 \n", + "Action|Adventure|Comedy|Crime|Music|Mystery 1 \n", + "Fantasy|Horror|Mystery|Romance 1 \n", + "Drama|History|Romance 1 \n", + "Crime|Drama|Mystery|Thriller|Western 1 \n", + "Action|Comedy|Drama|Sci-Fi 1 \n", + "Action|Adventure|Drama|War 1 \n", + "Adventure|Comedy|Crime 1 \n", + "Crime|Drama|Film-Noir|Mystery|Thriller 1 \n", + "Action|Animation|Sci-Fi|Thriller 1 \n", + "Comedy|Crime|Drama|Sci-Fi 1 \n", + "Action|Comedy|Crime|Music|Romance|Thriller 1 \n", + "Adventure|Comedy|Drama|Family|Mystery 1 \n", + "Adventure|Animation 1 \n", + "Action|Adventure|Comedy|Romance|Thriller|Western 1 \n", + "Drama|Family|Musical 1 \n", + "Drama|Musical|Sci-Fi 1 \n", + "Action|Adventure|Family|Fantasy|Sci-Fi 1 \n", + "Adventure|Crime|Drama|Thriller 1 \n", + "Action|Adventure|Animation|Drama|Fantasy|Sci-Fi 1 \n", + "Comedy|Family|Fantasy|Sport 1 \n", + "Biography|Drama|History|Thriller|War 1 \n", + "Action|Fantasy|Sci-Fi|Thriller 1 \n", + "Adventure|Animation|Drama|Family|Fantasy 1 \n", + "Adventure|Animation|Comedy|Drama|Family|Fantasy|Sci-Fi 1 \n", + "Drama|Film-Noir 1 \n", + "Drama|History|Romance|Western 1 \n", + "Comedy|Crime|Sci-Fi|Thriller 1 \n", + "Comedy|Family|Musical|Romance|Short 1 \n", + "Crime|Drama|Musical|Romance|Thriller 1 \n", + "Action|Crime|Drama|Mystery 1 \n", + "Action|Adventure|Drama|Western 1 \n", + "Animation|Drama|Family 1 \n", + "Comedy|Family|Music|Romance 1 \n", + "Action|Drama|Sci-Fi|Sport 1 \n", + "Adventure|Biography|Documentary|Drama 1 \n", + "Horror|Sci-Fi|Short|Thriller 1 \n", + "Action|Adventure|Crime|Drama|Family|Fantasy|Romance|Thriller 1 \n", + "Adventure|Animation|Comedy|Drama|Family|Fantasy|Romance 1 \n", + "Comedy|Documentary|War 1 \n", + "Action|Biography|Crime|Drama|Thriller 1 \n", + "Drama|Fantasy|Music|Romance 1 \n", + "Action|Adventure|Animation|Fantasy 1 \n", + "Adventure|Biography|Drama|War 1 \n", + "Comedy|Drama|Fantasy|Music|Romance 1 \n", + "Biography|Comedy|Crime|Drama|Romance|Thriller 1 \n", + "Drama|Mystery|Romance|Sci-Fi|Thriller 1 \n", + "Comedy|Drama|Horror|Sci-Fi|Thriller 1 \n", + "Action|Drama|Mystery|Thriller|War 1 \n", + "Action|Adventure|Family|Mystery|Sci-Fi 1 \n", + "Action|Adventure|Family|Sci-Fi|Thriller 1 \n", + "Comedy|Fantasy|Musical|Sci-Fi 1 \n", + "Drama|Fantasy|Mystery|Romance 1 \n", + "Action|Adventure|Crime 1 \n", + "Animation|Family|Fantasy|Musical|Romance 1 \n", + "Action|Comedy|Drama|War 1 \n", + "Adventure|Comedy|Drama|Family|Romance 1 \n", + "Comedy|Family|Fantasy|Music|Romance 1 \n", + "Comedy|Family|Fantasy|Horror|Mystery 1 \n", + "Fantasy|Mystery|Thriller 1 \n", + "Adventure|Documentary|Short 1 \n", + "Action|Biography|Documentary|Sport 1 \n", + "Crime|Drama|History 1 \n", + "Comedy|Crime|Drama|Music|Romance 1 \n", + "Adventure|Comedy|Crime|Family|Musical 1 \n", + "Action|Adventure|Comedy|Romance|Thriller 1 \n", + "Action|Adventure|Comedy|Musical 1 \n", + "Adventure|Crime|Drama|Mystery|Thriller 1 \n", + "Adventure|Drama|Fantasy|Mystery 1 \n", + "Crime|Drama|Musical 1 \n", + "Crime|Drama|Film-Noir 1 \n", + "Action|Adventure|Comedy|Fantasy|Mystery 1 \n", + "Adventure|Drama|History|Thriller|War 1 \n", + "Drama|Family|Western 1 \n", + "Documentary|Family 1 \n", + "Biography|Drama|Family|Musical|Romance 1 \n", + "Action|Fantasy|Western 1 \n", + "Animation|Drama 1 \n", + "Action|Drama|Mystery|Thriller 1 \n", + "Biography|Drama|Romance|Western 1 \n", + "Action|Crime|Sci-Fi 1 \n", + "Action|Comedy|Fantasy|Horror 1 \n", + "Action|Comedy|Crime|Sci-Fi|Thriller 1 \n", + "Animation|Comedy|Crime|Drama|Family 1 \n", + "Comedy|Drama|History|Musical|Romance 1 \n", + "Adventure|Comedy|Drama|Fantasy|Musical 1 \n", + "Action|Animation|Crime|Sci-Fi|Thriller 1 \n", + "Biography|Comedy|Drama|War 1 \n", + "Adventure|Documentary|Drama|Sport 1 \n", + "Adventure|Animation|Sci-Fi 1 \n", + "Adventure|Animation|Biography|Drama|Family|Fantasy|Musical 1 \n", + "Adventure|Comedy|Crime|Drama 1 \n", + "Biography|Crime|Drama|History|Western 1 \n", + "Action|Drama|History|Romance|War|Western 1 \n", + "Adventure|Animation|Family|Western 1 \n", + "Adventure|Sci-Fi 1 \n", + "Adventure|Comedy|Drama|Family 1 \n", + "Crime|Drama|Fantasy 1 \n", + "Animation|Comedy|Drama|Fantasy|Sci-Fi 1 \n", + "Crime|Thriller|War 1 \n", + "Crime|Fantasy|Horror 1 \n", + "Action|Crime|Drama|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Comedy|Crime|Drama|Romance|Thriller 1 \n", + "Adventure|Family|Fantasy|Sci-Fi 1 \n", + "Action|Adventure|Animation|Comedy|Fantasy 1 \n", + "Animation|Comedy|Family|Sport 1 \n", + "Action|Comedy|Fantasy|Romance 1 \n", + "Adventure|Family|Fantasy|Music|Musical 1 \n", + "Adventure|Drama|Horror|Mystery|Thriller 1 \n", + "Action|Horror|Mystery|Thriller 1 \n", + "Action|Family|Sport 1 \n", + "Biography|Drama|Family|History|Sport 1 \n", + "Action|Biography|Drama|Thriller|War 1 \n", + "Comedy|Family|Romance|Sport 1 \n", + "Action|Adventure|Drama|Romance|Sci-Fi 1 \n", + "Adventure|Family|Sci-Fi 1 \n", + "Adventure|Horror|Sci-Fi 1 \n", + "Drama|Fantasy|Sport 1 \n", + "Biography|Documentary|History 1 \n", + "Action|Adventure|Comedy|Drama|Music|Sci-Fi 1 \n", + "Crime|Drama|Mystery|Romance 1 \n", + "Drama|Fantasy|Mystery|Sci-Fi 1 \n", + "Adventure|Drama|History|Romance|Thriller|War 1 \n", + "Action|War 1 \n", + "Action|Drama|Fantasy|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Family|Fantasy|Romance 1 \n", + "Action|Adventure|Family|Mystery 1 \n", + "Action|Adventure|Drama|Family 1 \n", + "Biography 1 \n", + "Action|Drama|Romance|War 1 \n", + "Fantasy|Thriller 1 \n", + "Action|Adventure|Fantasy|Horror 1 \n", + "Documentary|News 1 \n", + "Action|Comedy|Sci-Fi|Sport 1 \n", + "Action|Adventure|Family|Fantasy|Sci-Fi|Thriller 1 \n", + "Crime|Drama|Music|Mystery|Thriller 1 \n", + "Adventure|War|Western 1 \n", + "Drama|Horror|Mystery|Sci-Fi 1 \n", + "Drama|Music|Mystery|Romance|Thriller 1 \n", + "Action|Adventure|Drama|Fantasy|War 1 \n", + "Action|Adventure|Animation|Family 1 \n", + "Adventure|Comedy|Horror|Sci-Fi 1 \n", + "Action|Fantasy|Horror|Mystery 1 \n", + "Adventure|Comedy|Family|Fantasy|Romance|Sport 1 \n", + "Action|Adventure|Crime|Drama|Sci-Fi|Thriller 1 \n", + "Action|Biography|Drama|History|Thriller|War 1 \n", + "Adventure|Biography|Crime|Drama|Western 1 \n", + "Action|Sport 1 \n", + "Comedy|Fantasy|Mystery 1 \n", + "Biography|Drama|Family|Sport 1 \n", + "Comedy|Music|Sci-Fi 1 \n", + "Documentary|Drama|History|News 1 \n", + "Mystery|Western 1 \n", + "Action|Adventure|Mystery|Romance|Thriller 1 \n", + "Comedy|Horror|Sci-Fi|Thriller 1 \n", + "Comedy|Crime|Drama|Mystery 1 \n", + "Adventure|Animation|Comedy|Family|Fantasy|Sci-Fi|Sport 1 \n", + "Action|Drama|Romance|Sport 1 \n", + "Animation|Family|Fantasy|Mystery 1 \n", + "Action|Animation|Sci-Fi 1 \n", + "Action|Adventure|History|Western 1 \n", + "Adventure|Drama|Horror|Thriller 1 \n", + "Documentary|Family|Music 1 \n", + "Biography|Documentary|Drama 1 \n", + "Adventure|Animation|Comedy|Drama|Family|Fantasy 1 \n", + "Biography|Drama|History|Musical 1 \n", + "Action|Fantasy|Horror|Mystery|Thriller 1 \n", + "Action|Adventure|Animation|Family|Sci-Fi|Thriller 1 \n", + "Action|Adventure|Animation|Fantasy|Romance|Sci-Fi 1 \n", + "Action|Biography|Drama|History|Romance|War 1 \n", + "Adventure|Animation|Comedy|Family|War 1 \n", + "Comedy|Documentary|Drama|Fantasy|Mystery|Sci-Fi 1 \n", + "Adventure|Drama|Fantasy|Mystery|Thriller 1 \n", + "Animation|Comedy|Family|Music|Romance 1 \n", + "Action|Comedy|Mystery 1 \n", + "Animation|Comedy|Drama|Family|Musical 1 \n", + "Adventure|Biography|Drama|History 1 \n", + "Drama|Fantasy|Mystery|Romance|Thriller 1 \n", + "Crime|Drama|Music|Thriller 1 \n", + "Adventure|Comedy|Crime|Romance 1 \n", + "Action|Biography|Crime|Drama|Family|Fantasy 1 \n", + "Action|Romance|Sport 1 \n", + "Biography|Comedy|Drama|History|Music 1 \n", + "Animation|Drama|Family|Musical|Romance 1 \n", + "Action|Adventure|Family|Fantasy|Thriller 1 \n", + "Biography|Drama|Family 1 \n", + "Fantasy|Horror|Romance 1 \n", + "Action|Adventure|Comedy|Family|Fantasy 1 \n", + "Horror|Romance|Thriller 1 \n", + "Comedy|Drama|Family|Fantasy|Musical 1 \n", + "Biography|Comedy|Musical|Romance|Western 1 \n", + "Animation|Comedy|Family|Fantasy|Musical|Romance 1 \n", + "Animation|Comedy|Family|Fantasy|Sci-Fi 1 \n", + "Adventure|Fantasy|Thriller 1 \n", + "Adventure|Family|Sport 1 \n", + "Adventure|Crime|Mystery|Sci-Fi|Thriller 1 \n", + "Action|Adventure|History|Romance 1 \n", + "Animation|Comedy|Family|Fantasy|Mystery 1 \n", + "Action|Animation|Comedy|Family 1 \n", + "Action|Adventure|Animation|Comedy|Drama|Family|Fantasy|Thriller 1 \n", + "Adventure|Crime|Drama|Western 1 \n", + "Action|Adventure|Comedy|Crime|Thriller 1 \n", + "Music 1 \n", + "Action|Comedy|Music 1 \n", + "Adventure|Drama|Family|Mystery 1 \n", + "Biography|Comedy|Musical 1 \n", + "Adventure|Comedy|Horror 1 \n", + "Adventure|Animation|Comedy|Crime 1 \n", + "Biography|Comedy|Documentary 1 \n", + "Action|Comedy|Mystery|Romance 1 \n", + "Action|Drama|Sport|Thriller 1 \n", + "Animation|Comedy|Drama|Romance 1 \n", + "Comedy|Fantasy|Horror|Mystery 1 \n", + "Crime|Drama|History|Mystery|Thriller 1 \n", + "Action|Horror|Romance 1 \n", + "Adventure|Comedy|Crime|Music 1 \n", + "Crime|Drama|Musical|Romance 1 \n", + "Adventure|Comedy|Sci-Fi|Western 1 \n", + "Crime|Drama|Fantasy|Mystery 1 \n", + "Action|Adventure|Drama|History|Thriller|War 1 \n", + "Action|Adventure|Biography|Drama|History|Thriller 1 \n", + "Comedy|Crime|Horror 1 \n", + "Adventure|Animation|Family|Fantasy|Musical|War 1 \n", + "Action|Adventure|Biography|Drama|History|Romance|War 1 \n", + "Comedy|Drama|Family|Fantasy|Sci-Fi 1 \n", + "Comedy|Crime|Musical|Mystery 1 \n", + "Adventure|Comedy|Drama|Romance|Thriller|War 1 \n", + "Adventure|Comedy|Family|Music|Romance 1 \n", + "Action|Comedy|Crime|Western 1 \n", + "Adventure|Drama|Thriller|War 1 \n", + "Biography|Crime|Drama|Mystery|Thriller 1 \n", + "Adventure|Comedy|Drama|Music 1 \n", + "Adventure|Animation|Comedy|Fantasy|Music|Romance 1 \n", + "Family|Fantasy|Music 1 \n", + "Action|Adventure|Drama|History|Romance|War 1 \n", + "Biography|Comedy|Crime|Drama|Romance 1 \n", + "Adventure|Comedy|Musical|Romance 1 \n", + "Name: genres, dtype: int64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.genres.value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #3: Show all columns and column width" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "pd.reset_option('display.width')\n", + "pd.reset_option('display.max_columns')\n", + "pd.reset_option('display.max_colwidth')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
\n", + "

2 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "0 3054.0 English USA PG-13 237000000.0 \n", + "1 1238.0 English USA PG-13 300000000.0 \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "0 2009.0 936.0 7.9 1.78 \n", + "1 2007.0 5000.0 7.1 2.35 \n", + "\n", + " movie_facebook_likes \n", + "0 33000 \n", + "1 0 \n", + "\n", + "[2 rows x 28 columns]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# show all columns on wider area\n", + "import pandas as pd\n", + "pd.set_option('display.width', None)\n", + "pd.set_option('display.max_columns', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenresactor_1_namemovie_titlenum_voted_userscast_total_facebook_likesactor_3_namefacenumber_in_posterplot_keywordsmovie_imdb_linknum_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-FiCCH PounderAvatar8862044834Wes Studi0.0avatar|future|marine|native|paraplegichttp://www.imdb.com/title/tt0499549/?ref_=fn_t...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|FantasyJohnny DeppPirates of the Caribbean: At World's End47122048350Jack Davenport0.0goddess|marriage ceremony|marriage proposal|pi...http://www.imdb.com/title/tt0449088/?ref_=fn_t...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "\n", + " actor_1_facebook_likes gross genres \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy \n", + "\n", + " actor_1_name movie_title num_voted_users \\\n", + "0 CCH Pounder Avatar  886204 \n", + "1 Johnny Depp Pirates of the Caribbean: At World's End  471220 \n", + "\n", + " cast_total_facebook_likes actor_3_name facenumber_in_poster \\\n", + "0 4834 Wes Studi 0.0 \n", + "1 48350 Jack Davenport 0.0 \n", + "\n", + " plot_keywords \\\n", + "0 avatar|future|marine|native|paraplegic \n", + "1 goddess|marriage ceremony|marriage proposal|pi... \n", + "\n", + " movie_imdb_link num_user_for_reviews \\\n", + "0 http://www.imdb.com/title/tt0499549/?ref_=fn_t... 3054.0 \n", + "1 http://www.imdb.com/title/tt0449088/?ref_=fn_t... 1238.0 \n", + "\n", + " language country content_rating budget title_year \\\n", + "0 English USA PG-13 237000000.0 2009.0 \n", + "1 English USA PG-13 300000000.0 2007.0 \n", + "\n", + " actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes \n", + "0 936.0 7.9 1.78 33000 \n", + "1 5000.0 7.1 2.35 0 " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "5 Action|Adventure|Sci-Fi\n", + "6 Action|Adventure|Romance\n", + "7 Adventure|Animation|Comedy|Family|Fantasy|Musi...\n", + "8 Action|Adventure|Sci-Fi\n", + "9 Adventure|Family|Fantasy|Mystery\n", + "Name: genres, dtype: object" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.iloc[5:10,9]" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "# display column values without truncation\n", + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "5 Action|Adventure|Sci-Fi\n", + "6 Action|Adventure|Romance\n", + "7 Adventure|Animation|Comedy|Family|Fantasy|Musical|Romance\n", + "8 Action|Adventure|Sci-Fi\n", + "9 Adventure|Family|Fantasy|Mystery\n", + "Name: genres, dtype: object" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.iloc[5:10,9]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step #5: Increase Jupyter Notebook cell width" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.core.display import display, HTML\n", + "display(HTML(\"\"))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.core.display import display, HTML\n", + "display(HTML(\"\"))\n", + "display(HTML(\"\"))\n", + "display(HTML(\"\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/How_to_Optimize_and_Speed_Up_Pandas.ipynb b/notebooks/pandas/How_to_Optimize_and_Speed_Up_Pandas.ipynb new file mode 100644 index 0000000..7325f23 --- /dev/null +++ b/notebooks/pandas/How_to_Optimize_and_Speed_Up_Pandas.ipynb @@ -0,0 +1,2373 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3 Simple ways to optimize pandas\n", + "\n", + "1. **Optimize datatypes**\n", + "2. **Use built-in functions**\n", + "3. **Search for smart alternative**\n", + "4. **Do tests** " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Bonus tips:\n", + "1. **Use NumPy arrays/matrix**\n", + "\n", + "```python \n", + "# Convert the frame to its Numpy-array representation. Deprecated since version 0.23.0\n", + "numpy_matrix = df.as_matrix() \n", + "\n", + "#Return a Numpy representation of the DataFrame.\n", + "df.values\n", + "\n", + "# Convert the DataFrame to a NumPy array\n", + "df.to_numpy() \n", + "```\n", + "2. **Optimize data when you read it**\n", + "```python \n", + "pandas.read_csv('foo.csv', dtype={'a': 'int'})\n", + "```\n", + "3. **Convert dates to Datetime**\n", + "\n", + "```python \n", + "df['start_date'] = pd.to_datetime(df['start_date'])\n", + "```\n", + "\n", + "4. **Loop Pandas data in smart way ( iterrows, itertuples, zip )**\n", + "\n", + "https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas\n", + "\n", + "```python\n", + "for i,r in t.iterrows(): # 0.5639059543609619\n", + "for ir in t.itertuples(): # 0.017839908599853516\n", + "for r in zip(t['a'], t['b']): # 0.005645036697387695\n", + "``` " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(98855, 129)\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "pd.set_option(\"display.max_columns\", None) # or 1000\n", + "pd.set_option(\"display.max_rows\", None) # or 1000\n", + "pd.set_option(\"display.max_colwidth\", 500) # or 199\n", + "pd.set_option(\"display.expand_frame_repr\", True) # or 199\n", + "\n", + "# read the data frame and see the data insight\n", + "df = pd.read_csv(\"../csv/stackoverflow/developer_survey_2018/survey_results_public.csv\", low_memory=False)\n", + "print(df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Respondent int64\n", + "Hobby object\n", + "OpenSource object\n", + "Country object\n", + "Student object\n", + "dtype: object" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevTypeYearsCodingYearsCodingProfJobSatisfactionCareerSatisfactionHopeFiveYearsJobSearchStatusLastNewJobAssessJob1AssessJob2AssessJob3AssessJob4AssessJob5AssessJob6AssessJob7AssessJob8AssessJob9AssessJob10AssessBenefits1AssessBenefits2AssessBenefits3AssessBenefits4AssessBenefits5AssessBenefits6AssessBenefits7AssessBenefits8AssessBenefits9AssessBenefits10AssessBenefits11JobContactPriorities1JobContactPriorities2JobContactPriorities3JobContactPriorities4JobContactPriorities5JobEmailPriorities1JobEmailPriorities2JobEmailPriorities3JobEmailPriorities4JobEmailPriorities5JobEmailPriorities6JobEmailPriorities7UpdateCVCurrencySalarySalaryTypeConvertedSalaryCurrencySymbolCommunicationToolsTimeFullyProductiveEducationTypesSelfTaughtTypesTimeAfterBootcampHackathonReasonsAgreeDisagree1AgreeDisagree2AgreeDisagree3LanguageWorkedWithLanguageDesireNextYearDatabaseWorkedWithDatabaseDesireNextYearPlatformWorkedWithPlatformDesireNextYearFrameworkWorkedWithFrameworkDesireNextYearIDEOperatingSystemNumberMonitorsMethodologyVersionControlCheckInCodeAdBlockerAdBlockerDisableAdBlockerReasonsAdsAgreeDisagree1AdsAgreeDisagree2AdsAgreeDisagree3AdsActionsAdsPriorities1AdsPriorities2AdsPriorities3AdsPriorities4AdsPriorities5AdsPriorities6AdsPriorities7AIDangerousAIInterestingAIResponsibleAIFutureEthicsChoiceEthicsReportEthicsResponsibleEthicalImplicationsStackOverflowRecommendStackOverflowVisitStackOverflowHasAccountStackOverflowParticipateStackOverflowJobsStackOverflowDevStoryStackOverflowJobsRecommendStackOverflowConsiderMemberHypotheticalTools1HypotheticalTools2HypotheticalTools3HypotheticalTools4HypotheticalTools5WakeTimeHoursComputerHoursOutsideSkipMealsErgonomicDevicesExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer3-5 years3-5 yearsExtremely satisfiedExtremely satisfiedWorking as a founder or co-founder of my own companyI’m not actively looking, but I am open to new opportunitiesLess than a year ago10.07.08.01.02.05.03.04.09.06.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN3.01.04.02.05.05.06.07.02.01.04.03.0My job status or other personal status changedNaNNaNMonthlyNaNKESSlackOne to three monthsTaught yourself a new language, framework, or tool without taking a formal course;Participated in a hackathonThe official documentation and/or standards for the technology;A book or e-book from O’Reilly, Apress, or a similar publisher;Questions & answers on Stack Overflow;Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.)NaNTo build my professional networkStrongly agreeStrongly agreeNeither Agree nor DisagreeJavaScript;Python;HTML;CSSJavaScript;Python;HTML;CSSRedis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc)Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc)AWS;Azure;Linux;FirebaseAWS;Azure;Linux;FirebaseDjango;ReactDjango;ReactKomodo;Vim;Visual Studio CodeLinux-based1Agile;ScrumGitMultiple times per dayYesNoNaNStrongly agreeStrongly agreeStrongly agreeSaw an online advertisement and then researched it (without clicking on the ad);Stopped going to a website because of their advertising1.05.04.07.02.06.03.0Artificial intelligence surpassing human intelligence (\"the singularity\")Algorithms making important decisionsThe developers or the people creating the AII'm excited about the possibilities more than worried about the dangers.NoYes, and publiclyUpper management at the company/organizationYes10 (Very Likely)Multiple times per dayYesI have never participated in Q&A on Stack OverflowNo, I knew that Stack Overflow had a jobs board but have never used or visited itYesNaNYesExtremely interestedExtremely interestedExtremely interestedExtremely interestedExtremely interestedBetween 5:00 - 6:00 AM9 - 12 hours1 - 2 hoursNeverStanding desk3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator30 or more years18-20 yearsModerately dissatisfiedNeither satisfied nor dissatisfiedWorking in a different or more specialized technical role than the one I'm in nowI am actively looking for a jobMore than 4 years ago1.07.010.08.02.05.04.03.06.09.01.05.03.07.010.04.011.09.06.02.08.03.01.05.02.04.01.03.04.05.02.06.07.0I saw an employer’s advertisementBritish pounds sterling (£)51000Yearly70841.0GBPConfluence;Office / productivity suite (Microsoft Office, Google Suite, etc.);Slack;Other wiki tool (Github, Google Sites, proprietary software, etc.)One to three monthsTaught yourself a new language, framework, or tool without taking a formal course;Contributed to open source softwareThe official documentation and/or standards for the technology;Questions & answers on Stack OverflowNaNNaNAgreeAgreeNeither Agree nor DisagreeJavaScript;Python;Bash/ShellGo;PythonRedis;PostgreSQL;MemcachedPostgreSQLLinuxLinuxDjangoReactIPython / Jupyter;Sublime Text;VimLinux-based2NaNGit;SubversionA few times per weekYesYesThe website I was visiting asked me to disable itSomewhat agreeNeither agree nor disagreeNeither agree nor disagreeNaN3.05.01.04.06.07.02.0Increasing automation of jobsIncreasing automation of jobsThe developers or the people creating the AII'm excited about the possibilities more than worried about the dangers.Depends on what it isDepends on what it isUpper management at the company/organizationYes10 (Very Likely)A few times per month or weeklyYesA few times per month or weeklyYesNo, I have one but it's out of date7YesA little bit interestedA little bit interestedA little bit interestedA little bit interestedA little bit interestedBetween 6:01 - 7:00 AM5 - 8 hours30 - 59 minutesNeverErgonomic keyboard or mouseDaily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student Employment \\\n", + "0 1 Yes No Kenya No Employed part-time \n", + "1 3 Yes Yes United Kingdom No Employed full-time \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "\n", + " YearsCoding YearsCodingProf JobSatisfaction \\\n", + "0 3-5 years 3-5 years Extremely satisfied \n", + "1 30 or more years 18-20 years Moderately dissatisfied \n", + "\n", + " CareerSatisfaction \\\n", + "0 Extremely satisfied \n", + "1 Neither satisfied nor dissatisfied \n", + "\n", + " HopeFiveYears \\\n", + "0 Working as a founder or co-founder of my own company \n", + "1 Working in a different or more specialized technical role than the one I'm in now \n", + "\n", + " JobSearchStatus \\\n", + "0 I’m not actively looking, but I am open to new opportunities \n", + "1 I am actively looking for a job \n", + "\n", + " LastNewJob AssessJob1 AssessJob2 AssessJob3 AssessJob4 \\\n", + "0 Less than a year ago 10.0 7.0 8.0 1.0 \n", + "1 More than 4 years ago 1.0 7.0 10.0 8.0 \n", + "\n", + " AssessJob5 AssessJob6 AssessJob7 AssessJob8 AssessJob9 AssessJob10 \\\n", + "0 2.0 5.0 3.0 4.0 9.0 6.0 \n", + "1 2.0 5.0 4.0 3.0 6.0 9.0 \n", + "\n", + " AssessBenefits1 AssessBenefits2 AssessBenefits3 AssessBenefits4 \\\n", + "0 NaN NaN NaN NaN \n", + "1 1.0 5.0 3.0 7.0 \n", + "\n", + " AssessBenefits5 AssessBenefits6 AssessBenefits7 AssessBenefits8 \\\n", + "0 NaN NaN NaN NaN \n", + "1 10.0 4.0 11.0 9.0 \n", + "\n", + " AssessBenefits9 AssessBenefits10 AssessBenefits11 JobContactPriorities1 \\\n", + "0 NaN NaN NaN 3.0 \n", + "1 6.0 2.0 8.0 3.0 \n", + "\n", + " JobContactPriorities2 JobContactPriorities3 JobContactPriorities4 \\\n", + "0 1.0 4.0 2.0 \n", + "1 1.0 5.0 2.0 \n", + "\n", + " JobContactPriorities5 JobEmailPriorities1 JobEmailPriorities2 \\\n", + "0 5.0 5.0 6.0 \n", + "1 4.0 1.0 3.0 \n", + "\n", + " JobEmailPriorities3 JobEmailPriorities4 JobEmailPriorities5 \\\n", + "0 7.0 2.0 1.0 \n", + "1 4.0 5.0 2.0 \n", + "\n", + " JobEmailPriorities6 JobEmailPriorities7 \\\n", + "0 4.0 3.0 \n", + "1 6.0 7.0 \n", + "\n", + " UpdateCV \\\n", + "0 My job status or other personal status changed \n", + "1 I saw an employer’s advertisement \n", + "\n", + " Currency Salary SalaryType ConvertedSalary \\\n", + "0 NaN NaN Monthly NaN \n", + "1 British pounds sterling (£) 51000 Yearly 70841.0 \n", + "\n", + " CurrencySymbol \\\n", + "0 KES \n", + "1 GBP \n", + "\n", + " CommunicationTools \\\n", + "0 Slack \n", + "1 Confluence;Office / productivity suite (Microsoft Office, Google Suite, etc.);Slack;Other wiki tool (Github, Google Sites, proprietary software, etc.) \n", + "\n", + " TimeFullyProductive \\\n", + "0 One to three months \n", + "1 One to three months \n", + "\n", + " EducationTypes \\\n", + "0 Taught yourself a new language, framework, or tool without taking a formal course;Participated in a hackathon \n", + "1 Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software \n", + "\n", + " SelfTaughtTypes \\\n", + "0 The official documentation and/or standards for the technology;A book or e-book from O’Reilly, Apress, or a similar publisher;Questions & answers on Stack Overflow;Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.) \n", + "1 The official documentation and/or standards for the technology;Questions & answers on Stack Overflow \n", + "\n", + " TimeAfterBootcamp HackathonReasons AgreeDisagree1 \\\n", + "0 NaN To build my professional network Strongly agree \n", + "1 NaN NaN Agree \n", + "\n", + " AgreeDisagree2 AgreeDisagree3 LanguageWorkedWith \\\n", + "0 Strongly agree Neither Agree nor Disagree JavaScript;Python;HTML;CSS \n", + "1 Agree Neither Agree nor Disagree JavaScript;Python;Bash/Shell \n", + "\n", + " LanguageDesireNextYear \\\n", + "0 JavaScript;Python;HTML;CSS \n", + "1 Go;Python \n", + "\n", + " DatabaseWorkedWith \\\n", + "0 Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc) \n", + "1 Redis;PostgreSQL;Memcached \n", + "\n", + " DatabaseDesireNextYear \\\n", + "0 Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc) \n", + "1 PostgreSQL \n", + "\n", + " PlatformWorkedWith PlatformDesireNextYear FrameworkWorkedWith \\\n", + "0 AWS;Azure;Linux;Firebase AWS;Azure;Linux;Firebase Django;React \n", + "1 Linux Linux Django \n", + "\n", + " FrameworkDesireNextYear IDE OperatingSystem \\\n", + "0 Django;React Komodo;Vim;Visual Studio Code Linux-based \n", + "1 React IPython / Jupyter;Sublime Text;Vim Linux-based \n", + "\n", + " NumberMonitors Methodology VersionControl CheckInCode \\\n", + "0 1 Agile;Scrum Git Multiple times per day \n", + "1 2 NaN Git;Subversion A few times per week \n", + "\n", + " AdBlocker AdBlockerDisable \\\n", + "0 Yes No \n", + "1 Yes Yes \n", + "\n", + " AdBlockerReasons AdsAgreeDisagree1 \\\n", + "0 NaN Strongly agree \n", + "1 The website I was visiting asked me to disable it Somewhat agree \n", + "\n", + " AdsAgreeDisagree2 AdsAgreeDisagree3 \\\n", + "0 Strongly agree Strongly agree \n", + "1 Neither agree nor disagree Neither agree nor disagree \n", + "\n", + " AdsActions \\\n", + "0 Saw an online advertisement and then researched it (without clicking on the ad);Stopped going to a website because of their advertising \n", + "1 NaN \n", + "\n", + " AdsPriorities1 AdsPriorities2 AdsPriorities3 AdsPriorities4 \\\n", + "0 1.0 5.0 4.0 7.0 \n", + "1 3.0 5.0 1.0 4.0 \n", + "\n", + " AdsPriorities5 AdsPriorities6 AdsPriorities7 \\\n", + "0 2.0 6.0 3.0 \n", + "1 6.0 7.0 2.0 \n", + "\n", + " AIDangerous \\\n", + "0 Artificial intelligence surpassing human intelligence (\"the singularity\") \n", + "1 Increasing automation of jobs \n", + "\n", + " AIInteresting \\\n", + "0 Algorithms making important decisions \n", + "1 Increasing automation of jobs \n", + "\n", + " AIResponsible \\\n", + "0 The developers or the people creating the AI \n", + "1 The developers or the people creating the AI \n", + "\n", + " AIFuture \\\n", + "0 I'm excited about the possibilities more than worried about the dangers. \n", + "1 I'm excited about the possibilities more than worried about the dangers. \n", + "\n", + " EthicsChoice EthicsReport \\\n", + "0 No Yes, and publicly \n", + "1 Depends on what it is Depends on what it is \n", + "\n", + " EthicsResponsible EthicalImplications \\\n", + "0 Upper management at the company/organization Yes \n", + "1 Upper management at the company/organization Yes \n", + "\n", + " StackOverflowRecommend StackOverflowVisit \\\n", + "0 10 (Very Likely) Multiple times per day \n", + "1 10 (Very Likely) A few times per month or weekly \n", + "\n", + " StackOverflowHasAccount StackOverflowParticipate \\\n", + "0 Yes I have never participated in Q&A on Stack Overflow \n", + "1 Yes A few times per month or weekly \n", + "\n", + " StackOverflowJobs \\\n", + "0 No, I knew that Stack Overflow had a jobs board but have never used or visited it \n", + "1 Yes \n", + "\n", + " StackOverflowDevStory StackOverflowJobsRecommend \\\n", + "0 Yes NaN \n", + "1 No, I have one but it's out of date 7 \n", + "\n", + " StackOverflowConsiderMember HypotheticalTools1 \\\n", + "0 Yes Extremely interested \n", + "1 Yes A little bit interested \n", + "\n", + " HypotheticalTools2 HypotheticalTools3 HypotheticalTools4 \\\n", + "0 Extremely interested Extremely interested Extremely interested \n", + "1 A little bit interested A little bit interested A little bit interested \n", + "\n", + " HypotheticalTools5 WakeTime HoursComputer \\\n", + "0 Extremely interested Between 5:00 - 6:00 AM 9 - 12 hours \n", + "1 A little bit interested Between 6:01 - 7:00 AM 5 - 8 hours \n", + "\n", + " HoursOutside SkipMeals ErgonomicDevices \\\n", + "0 1 - 2 hours Never Standing desk \n", + "1 30 - 59 minutes Never Ergonomic keyboard or mouse \n", + "\n", + " Exercise Gender SexualOrientation \\\n", + "0 3 - 4 times per week Male Straight or heterosexual \n", + "1 Daily or almost every day Male Straight or heterosexual \n", + "\n", + " EducationParents RaceEthnicity \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent \n", + "\n", + " Age Dependents MilitaryUS \\\n", + "0 25 - 34 years old Yes NaN \n", + "1 35 - 44 years old Yes NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['No', 'Yes'], dtype=object)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.OpenSource.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['No', 'Yes, part-time', nan, 'Yes, full-time'], dtype=object)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.Student.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "No 70399\n", + "Yes, full-time 18394\n", + "Yes, part-time 6108\n", + "NaN 3954\n", + "Name: Student, dtype: int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.Student.value_counts(dropna=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Optimize datatypes\n", + "\n", + "* donwcast numbers when is not needed\n", + "* use categories" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.75 MB\n", + "0.38 MB\n" + ] + } + ], + "source": [ + "# optimize storage for dataframe with numbers\n", + "\n", + "gl_int = df.select_dtypes(include=[\"int\"])\n", + "converted_int = gl_int.apply(pd.to_numeric, downcast=\"unsigned\")\n", + "\n", + "\n", + "def mem_usage(pandas_obj):\n", + " if isinstance(pandas_obj, pd.DataFrame):\n", + " usage_b = pandas_obj.memory_usage(deep=True).sum()\n", + " else: # we assume if not a df it's a series\n", + " usage_b = pandas_obj.memory_usage(deep=True)\n", + " usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes\n", + " return \"{:03.2f} MB\".format(usage_mb)\n", + "\n", + "print(mem_usage(gl_int))\n", + "print(mem_usage(converted_int))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "622.07 MB\n", + "572.40 MB\n" + ] + } + ], + "source": [ + "# convert columns to categorical\n", + "\n", + "def as_categorical(df):\n", + " df['CompanySize'] = df.CompanySize.astype('category')\n", + " df['Country'] = df.Country.astype('category')\n", + " df['Hobby'] = df.Hobby.astype('category')\n", + " df['YearsCoding'] = df.YearsCoding.astype('category')\n", + " df['Employment'] = df.Employment.astype('category')\n", + " df['LastNewJob'] = df.LastNewJob.astype('category')\n", + " df['JobSatisfaction'] = df.JobSatisfaction.astype('category')\n", + " df['CareerSatisfaction'] = df.CareerSatisfaction.astype('category') \n", + "\n", + "print(mem_usage(df))\n", + "as_categorical(df)\n", + "print(mem_usage(df))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "572.40 MB\n", + "566.89 MB\n" + ] + } + ], + "source": [ + "print(mem_usage(df))\n", + "df['OpenSource'] = df.OpenSource.astype('category')\n", + "print(mem_usage(df))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Use built-in functions" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 5765285 function calls (5689280 primitive calls) in 3.477 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 7999 0.005 0.000 0.008 0.000 :416(parent)\n", + " 192010 0.093 0.000 0.210 0.000 :997(_handle_fromlist)\n", + " 1 0.000 0.000 3.446 3.446 :5(before)\n", + " 1 0.019 0.019 3.445 3.445 :6()\n", + " 1 0.030 0.030 3.477 3.477 :1()\n", + " 4001 0.003 0.000 0.006 0.000 __init__.py:205(iteritems)\n", + " 1 0.000 0.000 0.012 0.012 _decorators.py:136(wrapper)\n", + " 9 0.000 0.000 0.001 0.000 _methods.py:26(_amax)\n", + " 9 0.000 0.000 0.001 0.000 _methods.py:30(_amin)\n", + " 8003 0.004 0.000 0.042 0.000 _methods.py:42(_any)\n", + " 1 0.000 0.000 0.000 0.000 _methods.py:45(_all)\n", + " 1 0.000 0.000 0.000 0.000 _validators.py:114(_check_for_invalid_keys)\n", + " 1 0.000 0.000 0.000 0.000 _validators.py:130(validate_kwargs)\n", + " 1 0.000 0.000 0.000 0.000 _validators.py:32(_check_for_default_values)\n", + " 10 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)\n", + " 10 0.000 0.000 0.000 0.000 abc.py:180(__instancecheck__)\n", + " 1 0.000 0.000 0.000 0.000 algorithms.py:141(_reconstruct_data)\n", + " 14 0.000 0.000 0.000 0.000 algorithms.py:1421(_get_take_nd_function)\n", + " 9 0.000 0.000 0.003 0.000 algorithms.py:1454(take)\n", + " 14 0.000 0.000 0.098 0.007 algorithms.py:1548(take_nd)\n", + " 1 0.000 0.000 0.000 0.000 algorithms.py:172(_ensure_arraylike)\n", + " 1 0.000 0.000 0.002 0.002 algorithms.py:224(_get_data_algo)\n", + " 1 0.000 0.000 0.008 0.008 algorithms.py:449(_factorize_array)\n", + " 2 0.000 0.000 0.000 0.000 algorithms.py:48(_ensure_data)\n", + " 1 0.000 0.000 0.012 0.012 algorithms.py:576(factorize)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1329(_get_names)\n", + " 3998 0.008 0.000 0.015 0.000 base.py:1695(_convert_slice_indexer)\n", + " 3 0.000 0.000 0.000 0.000 base.py:2033(__contains__)\n", + " 3998 0.010 0.000 0.221 0.000 base.py:2067(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2179(take)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2445(equals)\n", + " 4000 0.042 0.000 0.174 0.000 base.py:255(__new__)\n", + " 11994 0.005 0.000 0.006 0.000 base.py:4132(_validate_indexer)\n", + " 4001 0.009 0.000 0.019 0.000 base.py:473(_simple_new)\n", + " 8000 0.003 0.000 0.004 0.000 base.py:4914(_ensure_index)\n", + " 3999 0.012 0.000 0.194 0.000 base.py:520(_shallow_copy_with_infer)\n", + " 84052 0.051 0.000 0.093 0.000 base.py:61(is_dtype)\n", + " 1 0.000 0.000 0.000 0.000 base.py:615(is_)\n", + " 4001 0.002 0.000 0.002 0.000 base.py:635(_reset_identity)\n", + " 71988 0.023 0.000 0.032 0.000 base.py:641(__len__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:662(dtype)\n", + " 6 0.000 0.000 0.000 0.000 base.py:672(values)\n", + " 3 0.000 0.000 0.000 0.000 base.py:677(_values)\n", + " 2 0.000 0.000 0.000 0.000 base.py:711(get_values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:893(tolist)\n", + " 3999 0.002 0.000 0.003 0.000 base.py:904(_coerce_to_ndarray)\n", + " 1 0.000 0.000 0.000 0.000 base.py:912(__iter__)\n", + " 3999 0.003 0.000 0.006 0.000 base.py:920(_get_attributes_dict)\n", + " 3999 0.002 0.000 0.003 0.000 base.py:922()\n", + " 13 0.000 0.000 0.000 0.000 cast.py:257(maybe_promote)\n", + " 35991 0.020 0.000 0.060 0.000 cast.py:600(coerce_indexer_dtype)\n", + " 35991 0.022 0.000 0.082 0.000 categorical.py:147(_maybe_to_categorical)\n", + " 9 0.000 0.000 0.004 0.000 categorical.py:1774(take_nd)\n", + " 18 0.000 0.000 0.000 0.000 categorical.py:1841(__len__)\n", + " 35982 0.111 0.000 1.120 0.000 categorical.py:1943(__getitem__)\n", + " 35991 0.073 0.000 0.932 0.000 categorical.py:267(__init__)\n", + " 35991 0.023 0.000 0.036 0.000 categorical.py:381(categories)\n", + " 35991 0.017 0.000 0.027 0.000 categorical.py:420(ordered)\n", + " 179982 0.026 0.000 0.026 0.000 categorical.py:425(dtype)\n", + " 35991 0.006 0.000 0.006 0.000 categorical.py:434(_constructor)\n", + " 4001 0.003 0.000 0.021 0.000 common.py:1043(is_datetime64_any_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:1170(is_datetime_or_timedelta_dtype)\n", + " 14 0.000 0.000 0.000 0.000 common.py:122(is_sparse)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1294(is_datetimelike_v_numeric)\n", + " 3 0.000 0.000 0.000 0.000 common.py:1405(needs_i8_conversion)\n", + " 6 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)\n", + " 6 0.000 0.000 0.000 0.000 common.py:1578(is_bool_dtype)\n", + " 43 0.000 0.000 0.000 0.000 common.py:1688(is_extension_array_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1717(is_complex_dtype)\n", + " 8006 0.006 0.000 0.008 0.000 common.py:1784(_get_dtype)\n", + "12027/12026 0.012 0.000 0.019 0.000 common.py:1835(_get_dtype_type)\n", + " 35991 0.020 0.000 0.122 0.000 common.py:195(is_categorical)\n", + " 41 0.000 0.000 0.000 0.000 common.py:227(is_datetimetz)\n", + " 1 0.000 0.000 0.000 0.000 common.py:301(_asarray_tuplesafe)\n", + " 4003 0.004 0.000 0.013 0.000 common.py:332(is_datetime64_dtype)\n", + " 35982 0.019 0.000 0.023 0.000 common.py:359(is_null_slice)\n", + " 4048 0.002 0.000 0.005 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 3999 0.002 0.000 0.003 0.000 common.py:395(_apply_if_callable)\n", + " 4003 0.003 0.000 0.011 0.000 common.py:407(is_timedelta64_dtype)\n", + " 6 0.000 0.000 0.000 0.000 common.py:444(is_period_dtype)\n", + " 8015 0.003 0.000 0.012 0.000 common.py:477(is_interval_dtype)\n", + " 43995 0.020 0.000 0.054 0.000 common.py:513(is_categorical_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:546(is_string_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:647(is_datetimelike)\n", + " 4002 0.003 0.000 0.011 0.000 common.py:692(is_dtype_equal)\n", + " 4004 0.005 0.000 0.009 0.000 common.py:858(is_signed_integer_dtype)\n", + " 5 0.000 0.000 0.000 0.000 common.py:89(is_object_dtype)\n", + " 5 0.000 0.000 0.000 0.000 common.py:907(is_unsigned_integer_dtype)\n", + " 71982 0.032 0.000 0.542 0.000 dtypes.py:137(__init__)\n", + " 71982 0.062 0.000 0.511 0.000 dtypes.py:156(_finalize)\n", + " 1 0.000 0.000 0.000 0.000 dtypes.py:266(construct_from_string)\n", + " 71982 0.064 0.000 0.148 0.000 dtypes.py:278(validate_ordered)\n", + " 71982 0.094 0.000 0.301 0.000 dtypes.py:298(validate_categories)\n", + " 36009 0.006 0.000 0.006 0.000 dtypes.py:30(__unicode__)\n", + " 36009 0.014 0.000 0.019 0.000 dtypes.py:33(__str__)\n", + " 35991 0.063 0.000 0.392 0.000 dtypes.py:331(update_dtype)\n", + " 71982 0.013 0.000 0.013 0.000 dtypes.py:363(categories)\n", + " 71982 0.010 0.000 0.010 0.000 dtypes.py:370(ordered)\n", + " 3 0.000 0.000 0.000 0.000 dtypes.py:401(__new__)\n", + " 3 0.000 0.000 0.000 0.000 dtypes.py:459(construct_from_string)\n", + " 6 0.000 0.000 0.000 0.000 dtypes.py:584(is_dtype)\n", + " 4015 0.005 0.000 0.009 0.000 dtypes.py:707(is_dtype)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:2664(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:2690(_getitem_column)\n", + " 3999 0.001 0.000 0.001 0.000 frame.py:320(_constructor)\n", + " 3999 0.013 0.000 0.026 0.000 frame.py:334(__init__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:538(axes)\n", + " 3998 0.003 0.000 0.006 0.000 frame.py:844(__len__)\n", + " 3999 0.006 0.000 0.006 0.000 generic.py:124(__init__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:1264(_check_label_or_level_ambiguity)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:1294()\n", + " 1 0.000 0.000 0.000 0.000 generic.py:1520(__contains__)\n", + " 3999 0.004 0.000 0.005 0.000 generic.py:178(_init_mgr)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2484(_get_item_cache)\n", + " 3998 0.019 0.000 3.216 0.001 generic.py:2583(_slice)\n", + " 3999 0.005 0.000 0.009 0.000 generic.py:2603(_set_is_copy)\n", + " 1 0.000 0.000 0.101 0.101 generic.py:2783(_take)\n", + " 4002 0.004 0.000 0.005 0.000 generic.py:364(_get_axis_number)\n", + " 4001 0.004 0.000 0.007 0.000 generic.py:377(_get_axis_name)\n", + " 4001 0.003 0.000 0.011 0.000 generic.py:390(_get_axis)\n", + " 3999 0.003 0.000 0.008 0.000 generic.py:394(_get_block_manager_axis)\n", + " 3999 0.003 0.000 0.004 0.000 generic.py:4345(__finalize__)\n", + " 4001 0.004 0.000 0.004 0.000 generic.py:4378(__setattr__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:438(_info_axis)\n", + " 2 0.000 0.000 0.001 0.001 generic.py:4423(_protect_consolidate)\n", + " 2 0.000 0.000 0.001 0.001 generic.py:4433(_consolidate_inplace)\n", + " 2 0.000 0.000 0.001 0.001 generic.py:4436(f)\n", + " 3999 0.002 0.000 0.004 0.000 generic.py:458(ndim)\n", + " 1 0.000 0.000 0.001 0.001 generic.py:6592(groupby)\n", + " 324113 0.097 0.000 0.163 0.000 generic.py:7(_check)\n", + " 1 0.000 0.000 0.001 0.001 groupby.py:2143(groupby)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:2196(__init__)\n", + " 3999 0.003 0.000 3.418 0.001 groupby.py:2217(get_iterator)\n", + " 1 0.000 0.000 0.012 0.012 groupby.py:2231(_get_splitter)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:2235(_get_group_keys)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:2295(levels)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:2297()\n", + " 1 0.000 0.000 0.012 0.012 groupby.py:2333(group_info)\n", + " 1 0.000 0.000 0.012 0.012 groupby.py:2350(_get_compressed_labels)\n", + " 1 0.000 0.000 0.012 0.012 groupby.py:2351()\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:2939(__init__)\n", + " 2 0.000 0.000 0.012 0.006 groupby.py:3067(labels)\n", + " 2 0.000 0.000 0.000 0.000 groupby.py:3089(group_index)\n", + " 1 0.000 0.000 0.012 0.012 groupby.py:3095(_make_labels)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:3114(_get_grouper)\n", + " 2 0.000 0.000 0.000 0.000 groupby.py:3228()\n", + " 2 0.000 0.000 0.000 0.000 groupby.py:3229()\n", + " 2 0.000 0.000 0.000 0.000 groupby.py:3230()\n", + " 2 0.000 0.000 0.000 0.000 groupby.py:3235()\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:3258(is_in_axis)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:3268(is_in_obj)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:3327(_is_label_like)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:3332(_convert_grouper)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:5021(__init__)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:5028(slabels)\n", + " 1 0.000 0.000 0.001 0.001 groupby.py:5033(sort_idx)\n", + " 3999 0.008 0.000 3.402 0.001 groupby.py:5038(__iter__)\n", + " 1 0.000 0.000 0.102 0.102 groupby.py:5057(_get_sorted_data)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:5075(__init__)\n", + " 3998 0.008 0.000 3.292 0.001 groupby.py:5092(_chop)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:5120(get_splitter)\n", + " 1 0.000 0.000 0.001 0.001 groupby.py:567(__init__)\n", + " 1 0.000 0.000 0.000 0.000 groupby.py:881(__iter__)\n", + " 3998 0.008 0.000 3.284 0.001 indexing.py:1463(__getitem__)\n", + " 3998 0.003 0.000 3.220 0.001 indexing.py:147(_slice)\n", + " 3998 0.007 0.000 3.270 0.001 indexing.py:2040(_get_slice_axis)\n", + " 3998 0.003 0.000 3.274 0.001 indexing.py:2075(_getitem_axis)\n", + " 1 0.000 0.000 0.000 0.000 indexing.py:2441(maybe_convert_indices)\n", + " 9 0.000 0.000 0.002 0.000 indexing.py:2484(validate_indices)\n", + " 3998 0.001 0.000 0.001 0.000 indexing.py:2564(need_slice)\n", + " 3998 0.010 0.000 0.042 0.000 indexing.py:263(_convert_slice_indexer)\n", + " 10 0.000 0.000 0.000 0.000 inference.py:251(is_list_like)\n", + " 10 0.000 0.000 0.000 0.000 inference.py:287(is_array_like)\n", + " 1 0.000 0.000 0.000 0.000 inference.py:415(is_hashable)\n", + " 47988 0.052 0.000 0.108 0.000 internals.py:116(__init__)\n", + " 3 0.000 0.000 0.095 0.032 internals.py:1237(take_nd)\n", + " 47988 0.013 0.000 0.013 0.000 internals.py:127(_check_ndim)\n", + " 95976 0.034 0.000 0.070 0.000 internals.py:166(_consolidate_key)\n", + " 35991 0.041 0.000 0.112 0.000 internals.py:1723(__init__)\n", + " 18 0.000 0.000 0.000 0.000 internals.py:1745(shape)\n", + " 35991 0.037 0.000 0.217 0.000 internals.py:1864(__init__)\n", + " 35991 0.013 0.000 0.068 0.000 internals.py:1868(_maybe_coerce_values)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1891(fill_value)\n", + " 9 0.000 0.000 0.004 0.000 internals.py:1947(take_nd)\n", + " 35982 0.062 0.000 1.213 0.000 internals.py:1976(_slice)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:203(internal_values)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:229(fill_value)\n", + " 3999 0.006 0.000 0.024 0.000 internals.py:2298(__init__)\n", + " 156015 0.026 0.000 0.026 0.000 internals.py:233(mgr_locs)\n", + " 47988 0.018 0.000 0.024 0.000 internals.py:237(mgr_locs)\n", + " 35991 0.066 0.000 0.403 0.000 internals.py:2552(__init__)\n", + " 47988 0.033 0.000 0.535 0.000 internals.py:269(make_block_same_class)\n", + " 11994 0.009 0.000 0.009 0.000 internals.py:310(_slice)\n", + " 47988 0.043 0.000 0.502 0.000 internals.py:3191(make_block)\n", + " 4000 0.026 0.000 0.488 0.000 internals.py:3265(__init__)\n", + " 4000 0.004 0.000 0.008 0.000 internals.py:3266()\n", + " 47976 0.135 0.000 1.962 0.000 internals.py:328(getitem_block)\n", + " 16001 0.065 0.000 0.100 0.000 internals.py:3307(shape)\n", + " 48003 0.011 0.000 0.035 0.000 internals.py:3309()\n", + " 55998 0.013 0.000 0.017 0.000 internals.py:3311(ndim)\n", + " 7999 0.250 0.000 0.530 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:348(shape)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3488(_verify_integrity)\n", + " 26 0.000 0.000 0.000 0.000 internals.py:3490()\n", + " 143992 0.043 0.000 0.059 0.000 internals.py:352(dtype)\n", + " 48012 0.026 0.000 0.116 0.000 internals.py:356(ftype)\n", + " 4003 0.001 0.000 0.001 0.000 internals.py:3776(is_consolidated)\n", + " 4001 0.010 0.000 0.141 0.000 internals.py:3784(_consolidate_check)\n", + " 4001 0.013 0.000 0.130 0.000 internals.py:3785()\n", + " 3998 0.025 0.000 3.150 0.001 internals.py:3869(get_slice)\n", + " 3998 0.020 0.000 1.982 0.000 internals.py:3879()\n", + " 2 0.000 0.000 0.001 0.001 internals.py:4085(consolidate)\n", + " 4001 0.010 0.000 0.433 0.000 internals.py:4101(_consolidate_inplace)\n", + " 1 0.000 0.000 0.100 0.100 internals.py:4388(reindex_indexer)\n", + " 1 0.000 0.000 0.100 0.100 internals.py:4423()\n", + " 1 0.000 0.000 0.101 0.101 internals.py:4518(take)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4684(_block)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4718(dtype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4745(internal_values)\n", + " 3999 0.037 0.000 0.192 0.000 internals.py:5057(_consolidate)\n", + " 95976 0.025 0.000 0.095 0.000 internals.py:5063()\n", + " 15996 0.005 0.000 0.007 0.000 internals.py:5074(_merge_blocks)\n", + " 15996 0.022 0.000 0.036 0.000 internals.py:5101(_extend_blocks)\n", + " 9 0.000 0.000 0.000 0.000 missing.py:112(_isna_new)\n", + " 9 0.000 0.000 0.000 0.000 missing.py:32(isna)\n", + " 1 0.000 0.000 0.000 0.000 missing.py:376(array_equivalent)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:2389(array_equal)\n", + " 4000 0.011 0.000 0.046 0.000 numeric.py:35(__new__)\n", + " 51 0.000 0.000 0.000 0.000 numeric.py:433(asarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)\n", + " 7996 0.032 0.000 0.056 0.000 numeric.py:630(require)\n", + " 3999 0.007 0.000 0.202 0.000 numeric.py:64(_shallow_copy)\n", + " 15992 0.007 0.000 0.009 0.000 numeric.py:701()\n", + " 1 0.000 0.000 0.000 0.000 range.py:169(_data)\n", + " 1 0.000 0.000 0.000 0.000 range.py:173(_int64index)\n", + " 1 0.000 0.000 0.000 0.000 range.py:260(_shallow_copy)\n", + " 1 0.000 0.000 0.000 0.000 range.py:315(equals)\n", + " 8 0.000 0.000 0.000 0.000 range.py:481(__len__)\n", + " 1 0.000 0.000 0.000 0.000 series.py:412(dtype)\n", + " 1 0.000 0.000 0.000 0.000 series.py:465(_values)\n", + " 1 0.000 0.000 0.001 0.001 sorting.py:321(get_group_index_sorter)\n", + " 4001 0.001 0.000 0.001 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.all}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method builtins.any}\n", + " 4000 0.001 0.000 0.001 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 3.477 3.477 {built-in method builtins.exec}\n", + " 416168 0.096 0.000 0.096 0.000 {built-in method builtins.getattr}\n", + " 276050 0.115 0.000 0.115 0.000 {built-in method builtins.hasattr}\n", + " 4 0.000 0.000 0.000 0.000 {built-in method builtins.hash}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.id}\n", + " 956308 0.220 0.000 0.383 0.000 {built-in method builtins.isinstance}\n", + " 24077 0.005 0.000 0.005 0.000 {built-in method builtins.issubclass}\n", + " 4002 0.002 0.000 0.002 0.000 {built-in method builtins.iter}\n", + "424055/348051 0.085 0.000 0.111 0.000 {built-in method builtins.len}\n", + " 8 0.000 0.000 0.000 0.000 {built-in method builtins.max}\n", + " 3998 0.002 0.000 0.002 0.000 {built-in method builtins.min}\n", + " 3999 0.017 0.000 0.068 0.000 {built-in method builtins.sorted}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method builtins.sum}\n", + " 95990 0.067 0.000 0.067 0.000 {built-in method numpy.core.multiarray.arange}\n", + " 8048 0.016 0.000 0.016 0.000 {built-in method numpy.core.multiarray.array}\n", + " 16012 0.047 0.000 0.047 0.000 {built-in method numpy.core.multiarray.empty}\n", + " 3999 0.001 0.000 0.001 0.000 {built-in method pandas._libs.algos.ensure_int16}\n", + " 17 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 31992 0.006 0.000 0.006 0.000 {built-in method pandas._libs.algos.ensure_int8}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_platform_int}\n", + " 71992 0.009 0.000 0.009 0.000 {built-in method pandas._libs.lib.is_bool}\n", + " 13 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_float}\n", + " 12008 0.002 0.000 0.002 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 4007 0.009 0.000 0.009 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 9 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}\n", + " 1 0.000 0.000 0.000 0.000 {method 'all' of 'numpy.ndarray' objects}\n", + " 8003 0.005 0.000 0.047 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 47990 0.005 0.000 0.005 0.000 {method 'append' of 'list' objects}\n", + " 2 0.003 0.001 0.003 0.001 {method 'argsort' of 'numpy.ndarray' objects}\n", + " 11 0.000 0.000 0.000 0.000 {method 'astype' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 15998 0.010 0.000 0.010 0.000 {method 'fill' of 'numpy.ndarray' objects}\n", + " 48012 0.049 0.000 0.068 0.000 {method 'format' of 'str' objects}\n", + " 8019 0.002 0.000 0.002 0.000 {method 'get' of 'dict' objects}\n", + " 1 0.006 0.006 0.006 0.006 {method 'get_labels' of 'pandas._libs.hashtable.PyObjectHashTable' objects}\n", + " 8000 0.002 0.000 0.002 0.000 {method 'items' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}\n", + " 9 0.000 0.000 0.001 0.000 {method 'max' of 'numpy.ndarray' objects}\n", + " 9 0.000 0.000 0.001 0.000 {method 'min' of 'numpy.ndarray' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}\n", + " 8022 0.040 0.000 0.040 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 7999 0.002 0.000 0.002 0.000 {method 'rpartition' of 'str' objects}\n", + " 3 0.000 0.000 0.000 0.000 {method 'search' of '_sre.SRE_Pattern' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}\n", + " 5 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'take' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'to_array' of 'pandas._libs.hashtable.ObjectVector' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'tolist' of 'numpy.ndarray' objects}\n", + " 3999 0.001 0.000 0.001 0.000 {method 'update' of 'dict' objects}\n", + " 7996 0.001 0.000 0.001 0.000 {method 'upper' of 'str' objects}\n", + " 6 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 1 0.001 0.001 0.001 0.001 {pandas._libs.algos.groupsort_indexer}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_int16_int16}\n", + " 2 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int64_int64}\n", + " 8 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int8_int8}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_2d_axis0_int64_int64}\n", + " 1 0.007 0.007 0.007 0.007 {pandas._libs.algos.take_2d_axis1_float64_float64}\n", + " 1 0.072 0.072 0.072 0.072 {pandas._libs.algos.take_2d_axis1_object_object}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.lib.generate_slices}\n", + " 2 0.002 0.001 0.002 0.001 {pandas._libs.lib.infer_dtype}\n", + " 2 0.000 0.000 0.000 0.000 {pandas._libs.lib.values_from_object}\n", + "\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 3003 function calls (2973 primitive calls) in 0.114 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 2 0.000 0.000 0.000 0.000 :416(parent)\n", + " 94 0.000 0.000 0.000 0.000 :997(_handle_fromlist)\n", + " 1 0.000 0.000 0.094 0.094 :9(after)\n", + " 1 0.020 0.020 0.114 0.114 :1()\n", + " 1 0.000 0.000 0.000 0.000 __init__.py:205(iteritems)\n", + " 9 0.000 0.000 0.009 0.001 _methods.py:26(_amax)\n", + " 9 0.000 0.000 0.005 0.001 _methods.py:30(_amin)\n", + " 5 0.000 0.000 0.000 0.000 _methods.py:42(_any)\n", + " 10 0.000 0.000 0.000 0.000 _weakrefset.py:70(__contains__)\n", + " 10 0.000 0.000 0.000 0.000 abc.py:180(__instancecheck__)\n", + " 12 0.000 0.000 0.000 0.000 algorithms.py:1421(_get_take_nd_function)\n", + " 9 0.000 0.000 0.017 0.002 algorithms.py:1454(take)\n", + " 12 0.001 0.000 0.073 0.006 algorithms.py:1548(take_nd)\n", + " 1 0.000 0.000 0.000 0.000 algorithms.py:48(_ensure_data)\n", + " 1 0.000 0.000 0.003 0.003 algorithms.py:774(duplicated)\n", + " 1 0.000 0.000 0.003 0.003 base.py:1245(duplicated)\n", + " 2 0.000 0.000 0.000 0.000 base.py:2033(__contains__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2179(take)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2445(equals)\n", + " 1 0.000 0.000 0.000 0.000 base.py:255(__new__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:473(_simple_new)\n", + " 3 0.000 0.000 0.000 0.000 base.py:4914(_ensure_index)\n", + " 1 0.000 0.000 0.000 0.000 base.py:520(_shallow_copy_with_infer)\n", + " 70 0.000 0.000 0.000 0.000 base.py:61(is_dtype)\n", + " 1 0.000 0.000 0.000 0.000 base.py:615(is_)\n", + " 1 0.000 0.000 0.000 0.000 base.py:635(_reset_identity)\n", + " 17 0.000 0.000 0.000 0.000 base.py:641(__len__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:662(dtype)\n", + " 3 0.000 0.000 0.000 0.000 base.py:672(values)\n", + " 2 0.000 0.000 0.000 0.000 base.py:711(get_values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:904(_coerce_to_ndarray)\n", + " 1 0.000 0.000 0.000 0.000 base.py:920(_get_attributes_dict)\n", + " 1 0.000 0.000 0.000 0.000 base.py:922()\n", + " 12 0.000 0.000 0.001 0.000 cast.py:257(maybe_promote)\n", + " 9 0.000 0.000 0.000 0.000 cast.py:600(coerce_indexer_dtype)\n", + " 1 0.000 0.000 0.000 0.000 cast.py:853(maybe_castable)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:147(_maybe_to_categorical)\n", + " 9 0.000 0.000 0.018 0.002 categorical.py:1774(take_nd)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:1841(__len__)\n", + " 9 0.000 0.000 0.001 0.000 categorical.py:267(__init__)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:381(categories)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:420(ordered)\n", + " 36 0.000 0.000 0.000 0.000 categorical.py:425(dtype)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:434(_constructor)\n", + " 1 0.000 0.000 0.000 0.000 common.py:100(is_bool_indexer)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1043(is_datetime64_any_dtype)\n", + " 14 0.000 0.000 0.000 0.000 common.py:122(is_sparse)\n", + " 2 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:1578(is_bool_dtype)\n", + " 36 0.000 0.000 0.000 0.000 common.py:1688(is_extension_array_dtype)\n", + " 8 0.000 0.000 0.000 0.000 common.py:1784(_get_dtype)\n", + " 9 0.000 0.000 0.000 0.000 common.py:1835(_get_dtype_type)\n", + " 9 0.000 0.000 0.000 0.000 common.py:195(is_categorical)\n", + " 37 0.000 0.000 0.000 0.000 common.py:227(is_datetimetz)\n", + " 1 0.000 0.000 0.000 0.000 common.py:332(is_datetime64_dtype)\n", + " 38 0.000 0.000 0.000 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:395(_apply_if_callable)\n", + " 1 0.000 0.000 0.000 0.000 common.py:407(is_timedelta64_dtype)\n", + " 14 0.000 0.000 0.000 0.000 common.py:477(is_interval_dtype)\n", + " 11 0.000 0.000 0.000 0.000 common.py:513(is_categorical_dtype)\n", + " 4 0.000 0.000 0.000 0.000 common.py:692(is_dtype_equal)\n", + " 3 0.000 0.000 0.000 0.000 common.py:858(is_signed_integer_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:89(is_object_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:907(is_unsigned_integer_dtype)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:137(__init__)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:156(_finalize)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:278(validate_ordered)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:298(validate_categories)\n", + " 9 0.000 0.000 0.000 0.000 dtypes.py:30(__unicode__)\n", + " 9 0.000 0.000 0.000 0.000 dtypes.py:33(__str__)\n", + " 9 0.000 0.000 0.000 0.000 dtypes.py:331(update_dtype)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:363(categories)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:370(ordered)\n", + " 1 0.000 0.000 0.000 0.000 dtypes.py:401(__new__)\n", + " 1 0.000 0.000 0.000 0.000 dtypes.py:459(construct_from_string)\n", + " 13 0.000 0.000 0.000 0.000 dtypes.py:707(is_dtype)\n", + " 2 0.000 0.000 0.091 0.045 frame.py:2664(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:2690(_getitem_column)\n", + " 1 0.000 0.000 0.091 0.091 frame.py:2707(_getitem_array)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:320(_constructor)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:334(__init__)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:124(__init__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:1490(__hash__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:178(_init_mgr)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2484(_get_item_cache)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2603(_set_is_copy)\n", + " 1 0.000 0.000 0.090 0.090 generic.py:2783(_take)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:364(_get_axis_number)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:377(_get_axis_name)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:390(_get_axis)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:394(_get_block_manager_axis)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:4345(__finalize__)\n", + " 5 0.000 0.000 0.000 0.000 generic.py:4362(__getattr__)\n", + " 3 0.000 0.000 0.000 0.000 generic.py:4378(__setattr__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4423(_protect_consolidate)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4433(_consolidate_inplace)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4436(f)\n", + " 246 0.000 0.000 0.000 0.000 generic.py:7(_check)\n", + " 1 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)\n", + " 1 0.000 0.000 0.000 0.000 indexing.py:2345(check_bool_indexer)\n", + " 1 0.000 0.000 0.001 0.001 indexing.py:2441(maybe_convert_indices)\n", + " 9 0.000 0.000 0.014 0.002 indexing.py:2484(validate_indices)\n", + " 10 0.000 0.000 0.000 0.000 inference.py:251(is_list_like)\n", + " 9 0.000 0.000 0.000 0.000 inference.py:287(is_array_like)\n", + " 1 0.000 0.000 0.000 0.000 inference.py:415(is_hashable)\n", + " 13 0.000 0.000 0.000 0.000 internals.py:116(__init__)\n", + " 3 0.000 0.000 0.070 0.023 internals.py:1237(take_nd)\n", + " 13 0.000 0.000 0.000 0.000 internals.py:127(_check_ndim)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1723(__init__)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1745(shape)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1864(__init__)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1868(_maybe_coerce_values)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:1891(fill_value)\n", + " 9 0.000 0.000 0.018 0.002 internals.py:1947(take_nd)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:222(to_dense)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:229(fill_value)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:2298(__init__)\n", + " 49 0.000 0.000 0.000 0.000 internals.py:233(mgr_locs)\n", + " 13 0.000 0.000 0.000 0.000 internals.py:237(mgr_locs)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:2552(__init__)\n", + " 12 0.000 0.000 0.000 0.000 internals.py:269(make_block_same_class)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3148(get_block_type)\n", + " 13 0.000 0.000 0.000 0.000 internals.py:3191(make_block)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3265(__init__)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3266()\n", + " 4 0.000 0.000 0.000 0.000 internals.py:3307(shape)\n", + " 12 0.000 0.000 0.000 0.000 internals.py:3309()\n", + " 13 0.000 0.000 0.000 0.000 internals.py:3311(ndim)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3473(__len__)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:348(shape)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3488(_verify_integrity)\n", + " 13 0.000 0.000 0.000 0.000 internals.py:3490()\n", + " 22 0.000 0.000 0.000 0.000 internals.py:352(dtype)\n", + " 12 0.000 0.000 0.000 0.000 internals.py:356(ftype)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3784(_consolidate_check)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3785()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4085(consolidate)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4101(_consolidate_inplace)\n", + " 1 0.000 0.000 0.089 0.089 internals.py:4388(reindex_indexer)\n", + " 1 0.000 0.000 0.089 0.089 internals.py:4423()\n", + " 1 0.000 0.000 0.090 0.090 internals.py:4518(take)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4639(__init__)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:4684(_block)\n", + " 7 0.000 0.000 0.000 0.000 internals.py:4718(dtype)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4752(get_values)\n", + " 9 0.000 0.000 0.000 0.000 missing.py:112(_isna_new)\n", + " 9 0.000 0.000 0.000 0.000 missing.py:32(isna)\n", + " 1 0.000 0.000 0.000 0.000 missing.py:376(array_equivalent)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:35(__new__)\n", + " 43 0.000 0.000 0.000 0.000 numeric.py:433(asarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:64(_shallow_copy)\n", + " 1 0.000 0.000 0.000 0.000 range.py:260(_shallow_copy)\n", + " 2 0.000 0.000 0.000 0.000 range.py:315(equals)\n", + " 10 0.000 0.000 0.000 0.000 range.py:481(__len__)\n", + " 1 0.000 0.000 0.003 0.003 series.py:1577(duplicated)\n", + " 1 0.000 0.000 0.000 0.000 series.py:166(__init__)\n", + " 1 0.000 0.000 0.000 0.000 series.py:349(_constructor)\n", + " 1 0.000 0.000 0.000 0.000 series.py:365(_set_axis)\n", + " 1 0.000 0.000 0.000 0.000 series.py:391(_set_subtyp)\n", + " 2 0.000 0.000 0.000 0.000 series.py:401(name)\n", + " 1 0.000 0.000 0.000 0.000 series.py:4019(_sanitize_array)\n", + " 1 0.000 0.000 0.000 0.000 series.py:4036(_try_cast)\n", + " 2 0.000 0.000 0.000 0.000 series.py:405(name)\n", + " 7 0.000 0.000 0.000 0.000 series.py:412(dtype)\n", + " 2 0.000 0.000 0.000 0.000 series.py:476(get_values)\n", + " 1 0.000 0.000 0.000 0.000 series.py:562(__len__)\n", + " 2 0.000 0.000 0.000 0.000 series.py:637(__array__)\n", + " 1 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 0.114 0.114 {built-in method builtins.exec}\n", + " 323 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}\n", + " 160 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method builtins.hash}\n", + " 523 0.000 0.000 0.001 0.000 {built-in method builtins.isinstance}\n", + " 66 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.iter}\n", + " 164/136 0.000 0.000 0.000 0.000 {built-in method builtins.len}\n", + " 10 0.000 0.000 0.000 0.000 {built-in method builtins.max}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.sum}\n", + " 12 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.arange}\n", + " 46/44 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}\n", + " 14 0.015 0.001 0.015 0.001 {built-in method numpy.core.multiarray.empty}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int16}\n", + " 12 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 8 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int8}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_platform_int}\n", + " 27 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_bool}\n", + " 12 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_float}\n", + " 10 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 9 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 9 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}\n", + " 5 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 9 0.000 0.000 0.000 0.000 {method 'astype' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}\n", + " 14 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}\n", + " 16 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}\n", + " 9 0.000 0.000 0.009 0.001 {method 'max' of 'numpy.ndarray' objects}\n", + " 9 0.000 0.000 0.005 0.001 {method 'min' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'nonzero' of 'numpy.ndarray' objects}\n", + " 23 0.014 0.001 0.014 0.001 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'search' of '_sre.SRE_Pattern' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'take' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}\n", + " 5 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_int16_int16}\n", + " 8 0.001 0.000 0.001 0.000 {pandas._libs.algos.take_1d_int8_int8}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_2d_axis0_int64_int64}\n", + " 1 0.007 0.007 0.007 0.007 {pandas._libs.algos.take_2d_axis1_float64_float64}\n", + " 1 0.047 0.047 0.047 0.047 {pandas._libs.algos.take_2d_axis1_object_object}\n", + " 1 0.003 0.003 0.003 0.003 {pandas._libs.hashtable.duplicated_object}\n", + " 2 0.000 0.000 0.000 0.000 {pandas._libs.lib.values_from_object}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "# optimize performance by using built in functions\n", + "\n", + "import cProfile\n", + "\n", + "def before(df):\n", + " duplicated_data = [group for _, group in df.groupby('Salary') if len(group) > 1]\n", + "\n", + "\n", + "def after(df):\n", + " duplicated_data = df[df['Salary'].duplicated(keep=False)]\n", + "\n", + "\n", + "cProfile.run(\"before(df)\")\n", + "cProfile.run(\"after(df)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2.06 s ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit before(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "91.7 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit after(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Search for smart alternative" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 24422831 function calls (23928513 primitive calls) in 13.653 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 6 0.000 0.000 0.000 0.000 :416(parent)\n", + " 296623 0.156 0.000 0.348 0.000 :997(_handle_fromlist)\n", + " 1 0.049 0.049 13.653 13.653 :3(before)\n", + " 98855 0.221 0.000 4.778 0.000 :4()\n", + " 1 0.000 0.000 13.653 13.653 :1()\n", + " 1 0.002 0.002 0.002 0.002 __init__.py:124(lrange)\n", + " 4 0.000 0.000 0.000 0.000 __init__.py:205(iteritems)\n", + " 1 0.000 0.000 0.000 0.000 _decorators.py:136(wrapper)\n", + " 1 0.000 0.000 0.000 0.000 _methods.py:42(_any)\n", + " 2 0.000 0.000 0.000 0.000 _methods.py:45(_all)\n", + " 197721 0.097 0.000 0.097 0.000 _weakrefset.py:70(__contains__)\n", + " 197719 0.118 0.000 0.215 0.000 abc.py:180(__instancecheck__)\n", + " 10 0.000 0.000 0.000 0.000 algorithms.py:1421(_get_take_nd_function)\n", + " 10 0.000 0.000 0.004 0.000 algorithms.py:1548(take_nd)\n", + " 1 0.000 0.000 13.603 13.603 apply.py:105(get_result)\n", + " 1 0.000 0.000 0.000 0.000 apply.py:14(frame_apply)\n", + " 1 0.001 0.001 13.603 13.603 apply.py:219(apply_standard)\n", + " 1 0.235 0.235 13.551 13.551 apply.py:253(apply_series_generator)\n", + " 1 0.000 0.000 0.050 0.050 apply.py:293(wrap_results)\n", + " 1 0.000 0.000 0.000 0.000 apply.py:34(__init__)\n", + " 1 0.000 0.000 0.195 0.195 apply.py:363(series_generator)\n", + " 98856 0.197 0.000 8.310 0.000 apply.py:366()\n", + " 1 0.000 0.000 0.000 0.000 apply.py:370(result_index)\n", + " 1 0.000 0.000 0.000 0.000 apply.py:374(result_columns)\n", + " 98857 0.056 0.000 0.056 0.000 apply.py:85(columns)\n", + " 2 0.000 0.000 0.000 0.000 apply.py:89(index)\n", + " 1 0.000 0.000 0.194 0.194 apply.py:93(values)\n", + " 1 0.000 0.000 0.000 0.000 apply.py:97(dtypes)\n", + " 197710 0.099 0.000 0.367 0.000 base.py:1590(is_object)\n", + " 197710 0.209 0.000 0.611 0.000 base.py:1647(_convert_scalar_indexer)\n", + " 197712 0.161 0.000 0.196 0.000 base.py:2033(__contains__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2053(contains)\n", + " 2 0.000 0.000 0.000 0.000 base.py:2067(__getitem__)\n", + " 197710 0.145 0.000 0.708 0.000 base.py:2101(_can_hold_identifiers_and_holds_name)\n", + " 5/3 0.000 0.000 0.006 0.002 base.py:255(__new__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:3071(get_loc)\n", + " 197710 0.540 0.000 2.931 0.000 base.py:3090(get_value)\n", + " 1 0.000 0.000 0.000 0.000 base.py:4117(_maybe_cast_indexer)\n", + " 1 0.000 0.000 0.000 0.000 base.py:4355(insert)\n", + " 4 0.000 0.000 0.000 0.000 base.py:444()\n", + " 3 0.000 0.000 0.000 0.000 base.py:473(_simple_new)\n", + " 98862 0.036 0.000 0.056 0.000 base.py:4914(_ensure_index)\n", + " 1 0.000 0.000 0.000 0.000 base.py:520(_shallow_copy_with_infer)\n", + " 395751 0.201 0.000 0.307 0.000 base.py:61(is_dtype)\n", + " 3 0.000 0.000 0.000 0.000 base.py:635(_reset_identity)\n", + " 494294 0.153 0.000 0.206 0.000 base.py:641(__len__)\n", + " 3 0.000 0.000 0.000 0.000 base.py:647(__array__)\n", + " 17 0.000 0.000 0.000 0.000 base.py:672(values)\n", + " 7 0.000 0.000 0.000 0.000 base.py:677(_values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:789(_ndarray_values)\n", + " 1 0.002 0.002 0.003 0.003 base.py:850(_try_convert_to_int_index)\n", + " 2 0.000 0.000 0.000 0.000 base.py:893(tolist)\n", + " 1 0.000 0.000 0.000 0.000 base.py:904(_coerce_to_ndarray)\n", + " 3 0.000 0.000 0.002 0.001 base.py:912(__iter__)\n", + " 2 0.000 0.000 0.000 0.000 base.py:920(_get_attributes_dict)\n", + " 2 0.000 0.000 0.000 0.000 base.py:922()\n", + " 1 0.000 0.000 0.000 0.000 base.py:936(_coerce_scalar_to_index)\n", + " 1 0.000 0.000 0.001 0.001 cast.py:1093(find_common_type)\n", + " 2 0.000 0.000 0.001 0.000 cast.py:1118()\n", + " 2 0.000 0.000 0.000 0.000 cast.py:1121()\n", + " 2 0.001 0.001 0.001 0.001 cast.py:1207(construct_1d_object_array_from_listlike)\n", + " 98856 0.046 0.000 0.074 0.000 cast.py:1232(construct_1d_ndarray_preserving_na)\n", + " 9 0.000 0.000 0.000 0.000 cast.py:257(maybe_promote)\n", + " 1 0.000 0.000 0.003 0.003 cast.py:44(maybe_convert_platform)\n", + " 98858 0.095 0.000 0.095 0.000 cast.py:853(maybe_castable)\n", + " 98855 0.320 0.000 1.222 0.000 cast.py:867(maybe_infer_to_datetimelike)\n", + " 98857 0.354 0.000 1.810 0.000 cast.py:971(maybe_cast_to_datetime)\n", + " 9 0.000 0.000 0.004 0.000 categorical.py:1248(__array__)\n", + " 9 0.000 0.000 0.000 0.000 categorical.py:381(categories)\n", + " 27 0.000 0.000 0.000 0.000 categorical.py:425(dtype)\n", + " 4 0.000 0.000 0.000 0.000 common.py:1043(is_datetime64_any_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:1170(is_datetime_or_timedelta_dtype)\n", + " 197849 0.064 0.000 0.467 0.000 common.py:122(is_sparse)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1405(needs_i8_conversion)\n", + " 5 0.000 0.000 0.000 0.000 common.py:1527(is_float_dtype)\n", + " 4 0.000 0.000 0.000 0.000 common.py:1578(is_bool_dtype)\n", + " 98988 0.093 0.000 0.899 0.000 common.py:1629(is_extension_type)\n", + " 98891 0.126 0.000 0.485 0.000 common.py:1688(is_extension_array_dtype)\n", + " 5 0.000 0.000 0.000 0.000 common.py:1784(_get_dtype)\n", + " 296604 0.217 0.000 0.320 0.000 common.py:1835(_get_dtype_type)\n", + " 197844 0.100 0.000 0.586 0.000 common.py:195(is_categorical)\n", + " 2 0.000 0.000 0.000 0.000 common.py:1965(pandas_dtype)\n", + " 197869 0.094 0.000 0.523 0.000 common.py:227(is_datetimetz)\n", + " 5 0.000 0.000 0.001 0.000 common.py:301(_asarray_tuplesafe)\n", + " 8 0.000 0.000 0.000 0.000 common.py:332(is_datetime64_dtype)\n", + " 197877 0.086 0.000 0.235 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 197711 0.058 0.000 0.079 0.000 common.py:395(_apply_if_callable)\n", + " 7 0.000 0.000 0.000 0.000 common.py:407(is_timedelta64_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:444(is_period_dtype)\n", + " 21 0.000 0.000 0.000 0.000 common.py:477(is_interval_dtype)\n", + " 197858 0.096 0.000 0.254 0.000 common.py:513(is_categorical_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:546(is_string_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:647(is_datetimelike)\n", + " 2 0.000 0.000 0.001 0.000 common.py:692(is_dtype_equal)\n", + " 2 0.000 0.000 0.000 0.000 common.py:811(is_integer_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:858(is_signed_integer_dtype)\n", + " 395424 0.206 0.000 0.568 0.000 common.py:89(is_object_dtype)\n", + " 4 0.000 0.000 0.000 0.000 common.py:907(is_unsigned_integer_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:995(is_int_or_datetime_dtype)\n", + " 2 0.000 0.000 0.001 0.000 dtypes.py:172(__hash__)\n", + " 1 0.000 0.000 0.001 0.001 dtypes.py:183(__eq__)\n", + " 2 0.000 0.000 0.001 0.000 dtypes.py:227(_hash_categories)\n", + " 2 0.000 0.000 0.000 0.000 dtypes.py:241()\n", + " 4 0.000 0.000 0.000 0.000 dtypes.py:266(construct_from_string)\n", + " 16 0.000 0.000 0.000 0.000 dtypes.py:363(categories)\n", + " 5 0.000 0.000 0.000 0.000 dtypes.py:370(ordered)\n", + " 1 0.000 0.000 0.000 0.000 dtypes.py:584(is_dtype)\n", + " 2 0.000 0.000 0.000 0.000 dtypes.py:675(construct_from_string)\n", + " 18 0.000 0.000 0.000 0.000 dtypes.py:707(is_dtype)\n", + " 1 0.000 0.000 0.001 0.001 frame.py:3105(__setitem__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3165(_ensure_valid_index)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3182(_set_item)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3324(_sanitize_column)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3344(reindexer)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:555(shape)\n", + " 1 0.000 0.000 13.603 13.603 frame.py:5837(apply)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:844(__len__)\n", + " 2 0.000 0.000 0.000 0.000 fromnumeric.py:1471(ravel)\n", + " 1 0.000 0.000 0.000 0.000 function.py:38(__call__)\n", + " 2 0.000 0.000 0.000 0.000 function_base.py:4476(append)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:1141(__invert__)\n", + " 98861 0.118 0.000 0.118 0.000 generic.py:124(__init__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:164(_validate_dtype)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2577(_clear_item_cache)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2599(_set_item)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2633(_check_setitem_copy)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:364(_get_axis_number)\n", + " 3 0.000 0.000 0.000 0.000 generic.py:4345(__finalize__)\n", + " 296571 0.425 0.000 4.647 0.000 generic.py:4362(__getattr__)\n", + " 98863 0.204 0.000 0.531 0.000 generic.py:4378(__setattr__)\n", + " 197711 0.076 0.000 0.136 0.000 generic.py:438(_info_axis)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4423(_protect_consolidate)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4433(_consolidate_inplace)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4436(f)\n", + " 1 0.000 0.000 0.194 0.194 generic.py:4563(values)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4765(dtypes)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4890(astype)\n", + " 1483499 0.447 0.000 0.971 0.000 generic.py:7(_check)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:9675(logical_func)\n", + " 2 0.000 0.000 0.000 0.000 hashing.py:23(_combine_hash_arrays)\n", + " 2 0.000 0.000 0.000 0.000 hashing.py:230(hash_array)\n", + " 1 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)\n", + " 4 0.000 0.000 0.000 0.000 inference.py:119(is_iterator)\n", + " 197719 0.099 0.000 0.464 0.000 inference.py:251(is_list_like)\n", + " 1 0.000 0.000 0.000 0.000 inference.py:364(is_dict_like)\n", + " 98855 0.039 0.000 0.059 0.000 inference.py:415(is_hashable)\n", + " 1 0.000 0.000 0.000 0.000 inference.py:447(is_sequence)\n", + " 1 0.000 0.000 0.000 0.000 inspect.py:73(isclass)\n", + " 98861 0.189 0.000 0.387 0.000 internals.py:116(__init__)\n", + " 98861 0.039 0.000 0.039 0.000 internals.py:127(_check_ndim)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:184(is_categorical_astype)\n", + " 9 0.000 0.000 0.005 0.001 internals.py:1937(get_values)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:199(external_values)\n", + " 197712 0.033 0.000 0.033 0.000 internals.py:203(internal_values)\n", + " 3 0.000 0.000 0.084 0.028 internals.py:213(get_values)\n", + " 197711 0.082 0.000 0.171 0.000 internals.py:222(to_dense)\n", + " 98857 0.151 0.000 0.553 0.000 internals.py:2298(__init__)\n", + " 98874 0.019 0.000 0.019 0.000 internals.py:233(mgr_locs)\n", + " 98861 0.086 0.000 0.102 0.000 internals.py:237(mgr_locs)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:269(make_block_same_class)\n", + " 98860 0.315 0.000 1.621 0.000 internals.py:3148(get_block_type)\n", + " 98861 0.138 0.000 2.312 0.000 internals.py:3191(make_block)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:3307(shape)\n", + " 9 0.000 0.000 0.000 0.000 internals.py:3309()\n", + " 3 0.000 0.000 0.000 0.000 internals.py:3311(ndim)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3315(set_axis)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3351(_is_single_block)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3404(get_dtypes)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3405()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3473(__len__)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3500(apply)\n", + " 197736 0.049 0.000 0.049 0.000 internals.py:352(dtype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3561()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3713(astype)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3789(is_mixed_type)\n", + " 1 0.000 0.000 0.194 0.194 internals.py:3922(as_array)\n", + " 1 0.074 0.074 0.194 0.194 internals.py:3953(_interleave)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4085(consolidate)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4101(_consolidate_inplace)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4208(set)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4323(insert)\n", + " 98860 0.211 0.000 2.637 0.000 internals.py:4639(__init__)\n", + " 593135 0.118 0.000 0.118 0.000 internals.py:4684(_block)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4709(index)\n", + " 197711 0.114 0.000 0.203 0.000 internals.py:4718(dtype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4742(external_values)\n", + " 197712 0.128 0.000 0.209 0.000 internals.py:4745(internal_values)\n", + " 197711 0.173 0.000 0.430 0.000 internals.py:4752(get_values)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4774(_consolidate_inplace)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:5044(_interleaved_dtype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5048()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5101(_extend_blocks)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:573(astype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:577(_astype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5880(_fast_count_smallints)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:774(copy)\n", + " 2 0.000 0.000 0.000 0.000 missing.py:112(_isna_new)\n", + " 1 0.000 0.000 0.000 0.000 missing.py:189(_isna_ndarraylike)\n", + " 2 0.000 0.000 0.000 0.000 missing.py:32(isna)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:179(_get_fill_value)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:202(_get_values)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:256(_na_ok_dtype)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:260(_view_if_needed)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:318(nanany)\n", + " 5 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)\n", + " 4 0.000 0.000 0.000 0.000 numeric.py:2491(seterr)\n", + " 4 0.000 0.000 0.000 0.000 numeric.py:2592(geterr)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:2887(__init__)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:2891(__enter__)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:2896(__exit__)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:35(__new__)\n", + " 27/18 0.000 0.000 0.006 0.000 numeric.py:433(asarray)\n", + " 5 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:94(zeros_like)\n", + " 4 0.000 0.000 0.000 0.000 numerictypes.py:619(issubclass_)\n", + " 2 0.000 0.000 0.000 0.000 numerictypes.py:687(issubdtype)\n", + " 1 0.000 0.000 0.002 0.002 range.py:257(tolist)\n", + " 1 0.000 0.000 0.000 0.000 range.py:315(equals)\n", + " 12 0.000 0.000 0.000 0.000 range.py:481(__len__)\n", + "98861/98860 0.629 0.000 8.107 0.000 series.py:166(__init__)\n", + " 1 0.040 0.040 0.049 0.049 series.py:284(_init_dict)\n", + " 1 0.000 0.000 0.001 0.001 series.py:3069(apply)\n", + " 1 0.000 0.000 0.000 0.000 series.py:3203(_reduce)\n", + " 3 0.000 0.000 0.000 0.000 series.py:349(_constructor)\n", + " 98862 0.102 0.000 0.143 0.000 series.py:365(_set_axis)\n", + " 98862 0.041 0.000 0.041 0.000 series.py:391(_set_subtyp)\n", + " 197719 0.124 0.000 0.214 0.000 series.py:401(name)\n", + " 98859 0.290 0.000 3.551 0.000 series.py:4019(_sanitize_array)\n", + " 98858 0.204 0.000 3.098 0.000 series.py:4036(_try_cast)\n", + " 98864 0.076 0.000 0.135 0.000 series.py:405(name)\n", + " 197711 0.097 0.000 0.301 0.000 series.py:412(dtype)\n", + " 1 0.000 0.000 0.000 0.000 series.py:432(values)\n", + " 197712 0.093 0.000 0.302 0.000 series.py:465(_values)\n", + " 197711 0.080 0.000 0.510 0.000 series.py:476(get_values)\n", + " 1 0.000 0.000 0.000 0.000 series.py:562(__len__)\n", + " 1 0.000 0.000 0.000 0.000 series.py:643(__array_wrap__)\n", + " 197710 0.340 0.000 3.379 0.000 series.py:764(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 shape_base.py:63(atleast_2d)\n", + " 3 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method _operator.inv}\n", + " 4 0.000 0.000 0.001 0.000 {built-in method builtins.all}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method builtins.any}\n", + " 197711 0.021 0.000 0.021 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 13.653 13.653 {built-in method builtins.exec}\n", + " 2571249 0.852 0.000 1.154 0.000 {built-in method builtins.getattr}\n", + " 395544 0.182 0.000 0.182 0.000 {built-in method builtins.hasattr}\n", + " 296570 0.055 0.000 0.056 0.000 {built-in method builtins.hash}\n", + " 4449733 1.121 0.000 2.308 0.000 {built-in method builtins.isinstance}\n", + " 1087532 0.139 0.000 0.139 0.000 {built-in method builtins.issubclass}\n", + " 10 0.000 0.000 0.000 0.000 {built-in method builtins.iter}\n", + "1482938/988641 0.308 0.000 0.461 0.000 {built-in method builtins.len}\n", + " 12 0.000 0.000 0.000 0.000 {built-in method builtins.max}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method builtins.next}\n", + "395457/395448 0.116 0.000 0.121 0.000 {built-in method numpy.core.multiarray.array}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.concatenate}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.copyto}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty_like}\n", + " 14 0.032 0.002 0.032 0.002 {built-in method numpy.core.multiarray.empty}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.putmask}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}\n", + " 8 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.geterrobj}\n", + " 4 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.seterrobj}\n", + " 10 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 98855 0.016 0.000 0.016 0.000 {built-in method pandas._libs.algos.ensure_object}\n", + " 98855 0.043 0.000 0.043 0.000 {built-in method pandas._libs.lib.infer_datetimelike_array}\n", + " 197720 0.029 0.000 0.029 0.000 {built-in method pandas._libs.lib.is_float}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 197717 0.029 0.000 0.029 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}\n", + " 2 0.000 0.000 0.000 0.000 {method 'all' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 98857 0.012 0.000 0.012 0.000 {method 'append' of 'list' objects}\n", + " 4 0.085 0.021 0.085 0.021 {method 'astype' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'clear' of 'dict' objects}\n", + " 3 0.000 0.000 0.000 0.000 {method 'copy' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}\n", + " 12 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}\n", + " 197710 0.232 0.000 0.232 0.000 {method 'get_value' of 'pandas._libs.index.IndexEngine' objects}\n", + " 4 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}\n", + " 3 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}\n", + " 5 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 9 0.000 0.000 0.000 0.000 {method 'reshape' of 'numpy.ndarray' objects}\n", + " 6 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'tolist' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'transpose' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}\n", + " 197731 0.089 0.000 0.089 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 10 0.003 0.000 0.003 0.000 {pandas._libs.algos.take_1d_object_object}\n", + " 2 0.000 0.000 0.000 0.000 {pandas._libs.hashing.hash_object_array}\n", + " 2 0.001 0.000 0.001 0.000 {pandas._libs.lib.infer_dtype}\n", + " 1 0.000 0.000 0.001 0.001 {pandas._libs.lib.map_infer}\n", + " 1 0.002 0.002 0.002 0.002 {pandas._libs.lib.maybe_convert_objects}\n", + " 395422 0.206 0.000 0.715 0.000 {pandas._libs.lib.values_from_object}\n", + "\n", + "\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 1088861 function calls (1088853 primitive calls) in 0.353 seconds\n", + "\n", + " Ordered by: standard name\n", + "\n", + " ncalls tottime percall cumtime percall filename:lineno(function)\n", + " 18 0.000 0.000 0.000 0.000 :997(_handle_fromlist)\n", + " 1 0.005 0.005 0.353 0.353 :7(after)\n", + " 1 0.000 0.000 0.353 0.353 :1()\n", + " 2 0.000 0.000 0.000 0.000 __init__.py:205(iteritems)\n", + " 3 0.000 0.000 0.000 0.000 _methods.py:42(_any)\n", + " 98862 0.024 0.000 0.024 0.000 _weakrefset.py:70(__contains__)\n", + " 98861 0.030 0.000 0.054 0.000 abc.py:180(__instancecheck__)\n", + " 1 0.000 0.000 0.000 0.000 accessor.py:129(__get__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:142(_freeze)\n", + " 3 0.000 0.000 0.000 0.000 base.py:147(__setattr__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1569(is_unique)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1935(_engine)\n", + " 1 0.000 0.000 0.000 0.000 base.py:1938()\n", + " 3 0.000 0.000 0.000 0.000 base.py:2033(__contains__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2053(contains)\n", + " 2 0.000 0.000 0.000 0.000 base.py:2067(__getitem__)\n", + " 1 0.000 0.000 0.000 0.000 base.py:2465(identical)\n", + " 2 0.000 0.000 0.000 0.000 base.py:2470()\n", + " 5 0.000 0.000 0.000 0.000 base.py:3071(get_loc)\n", + " 14 0.000 0.000 0.000 0.000 base.py:4914(_ensure_index)\n", + " 22 0.000 0.000 0.000 0.000 base.py:61(is_dtype)\n", + " 2 0.000 0.000 0.000 0.000 base.py:635(_reset_identity)\n", + " 2 0.000 0.000 0.000 0.000 base.py:641(__len__)\n", + " 2 0.000 0.000 0.000 0.000 base.py:672(values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:677(_values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:789(_ndarray_values)\n", + " 1 0.000 0.000 0.000 0.000 base.py:924(view)\n", + " 1 0.000 0.000 0.000 0.000 cast.py:1232(construct_1d_ndarray_preserving_na)\n", + " 2 0.000 0.000 0.000 0.000 cast.py:853(maybe_castable)\n", + " 2 0.000 0.000 0.000 0.000 cast.py:867(maybe_infer_to_datetimelike)\n", + " 2 0.000 0.000 0.000 0.000 cast.py:971(maybe_cast_to_datetime)\n", + " 2 0.000 0.000 0.000 0.000 common.py:1170(is_datetime_or_timedelta_dtype)\n", + " 9 0.000 0.000 0.000 0.000 common.py:122(is_sparse)\n", + " 1 0.000 0.000 0.000 0.000 common.py:123(_default_index)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1405(needs_i8_conversion)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1490(is_string_like_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:154(_all_none)\n", + " 1 0.000 0.000 0.000 0.000 common.py:1578(is_bool_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:1629(is_extension_type)\n", + " 8 0.000 0.000 0.000 0.000 common.py:1688(is_extension_array_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:1784(_get_dtype)\n", + " 11 0.000 0.000 0.000 0.000 common.py:1835(_get_dtype_type)\n", + " 5 0.000 0.000 0.000 0.000 common.py:195(is_categorical)\n", + " 8 0.000 0.000 0.000 0.000 common.py:227(is_datetimetz)\n", + " 10 0.000 0.000 0.000 0.000 common.py:369(is_datetime64tz_dtype)\n", + " 3 0.000 0.000 0.000 0.000 common.py:395(_apply_if_callable)\n", + " 2 0.000 0.000 0.000 0.000 common.py:444(is_period_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:477(is_interval_dtype)\n", + " 9 0.000 0.000 0.000 0.000 common.py:513(is_categorical_dtype)\n", + " 2 0.000 0.000 0.000 0.000 common.py:546(is_string_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:811(is_integer_dtype)\n", + " 9 0.000 0.000 0.000 0.000 common.py:89(is_object_dtype)\n", + " 1 0.000 0.000 0.000 0.000 common.py:995(is_int_or_datetime_dtype)\n", + " 2 0.000 0.000 0.000 0.000 concat.py:105(_get_sliced_frame_result_type)\n", + " 2 0.000 0.000 0.000 0.000 dtypes.py:584(is_dtype)\n", + " 2 0.000 0.000 0.000 0.000 dtypes.py:707(is_dtype)\n", + " 2 0.000 0.000 0.001 0.000 frame.py:2664(__getitem__)\n", + " 2 0.000 0.000 0.000 0.000 frame.py:2690(_getitem_column)\n", + " 2 0.000 0.000 0.000 0.000 frame.py:3093(_box_item_values)\n", + " 2 0.000 0.000 0.000 0.000 frame.py:3100(_box_col_values)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3105(__setitem__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3165(_ensure_valid_index)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3182(_set_item)\n", + " 3 0.000 0.000 0.000 0.000 frame.py:320(_constructor)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3324(_sanitize_column)\n", + " 3 0.000 0.000 0.005 0.002 frame.py:334(__init__)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3344(reindexer)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:3541(align)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:461(_init_ndarray)\n", + " 1 0.000 0.000 0.002 0.002 frame.py:4759(_combine_match_index)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:478(_get_axes)\n", + " 1 0.000 0.000 0.001 0.001 frame.py:6845(_reduce)\n", + " 1 0.000 0.000 0.001 0.001 frame.py:6856(f)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7047(_get_agg_axis)\n", + " 1 0.000 0.000 0.003 0.003 frame.py:7255(isin)\n", + " 1 0.000 0.000 0.001 0.001 frame.py:7349(_arrays_to_mgr)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7419(_prep_ndarray)\n", + " 1 0.000 0.000 0.004 0.004 frame.py:7453(_to_arrays)\n", + " 1 0.000 0.000 0.004 0.004 frame.py:7547(_list_to_arrays)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7604(_convert_object_array)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7615(convert)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7621()\n", + " 1 0.000 0.000 0.000 0.000 frame.py:7644(_homogenize)\n", + " 1 0.000 0.000 0.000 0.000 frame.py:844(__len__)\n", + " 1 0.000 0.000 0.000 0.000 function.py:38(__call__)\n", + " 7 0.000 0.000 0.000 0.000 generic.py:124(__init__)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:178(_init_mgr)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:2484(_get_item_cache)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:2498(_set_as_cached)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2577(_clear_item_cache)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2599(_set_item)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:2633(_check_setitem_copy)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:297(_construct_axes_dict)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:299()\n", + " 1 0.000 0.000 0.001 0.001 generic.py:3058(reindex_like)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:317(_construct_axes_from_arguments)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:349()\n", + " 3 0.000 0.000 0.000 0.000 generic.py:364(_get_axis_number)\n", + " 1 0.000 0.000 0.001 0.001 generic.py:3647(reindex)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:3674()\n", + " 2 0.000 0.000 0.000 0.000 generic.py:377(_get_axis_name)\n", + " 2 0.000 0.000 0.000 0.000 generic.py:390(_get_axis)\n", + " 3 0.000 0.000 0.000 0.000 generic.py:4345(__finalize__)\n", + " 4 0.000 0.000 0.000 0.000 generic.py:4362(__getattr__)\n", + " 11 0.000 0.000 0.000 0.000 generic.py:4378(__setattr__)\n", + " 4 0.000 0.000 0.000 0.000 generic.py:4423(_protect_consolidate)\n", + " 3 0.000 0.000 0.000 0.000 generic.py:4433(_consolidate_inplace)\n", + " 3 0.000 0.000 0.000 0.000 generic.py:4436(f)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4475(_is_mixed_type)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:4477()\n", + " 2 0.000 0.000 0.000 0.000 generic.py:4563(values)\n", + " 1 0.000 0.000 0.001 0.001 generic.py:5009(copy)\n", + " 67 0.000 0.000 0.000 0.000 generic.py:7(_check)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:7332(align)\n", + " 1 0.000 0.000 0.000 0.000 generic.py:7423(_align_series)\n", + " 1 0.000 0.000 0.001 0.001 generic.py:9675(logical_func)\n", + " 1 0.000 0.000 0.000 0.000 indexing.py:2321(convert_to_index_sliceable)\n", + " 98861 0.035 0.000 0.133 0.000 inference.py:251(is_list_like)\n", + " 1 0.000 0.000 0.000 0.000 inference.py:388(is_named_tuple)\n", + " 4 0.000 0.000 0.000 0.000 inference.py:415(is_hashable)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:116(__init__)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:127(_check_ndim)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:199(external_values)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:203(internal_values)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:213(get_values)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:2278(should_store)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:2298(__init__)\n", + " 15 0.000 0.000 0.000 0.000 internals.py:233(mgr_locs)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:237(mgr_locs)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:269(make_block_same_class)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:3148(get_block_type)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:3191(make_block)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3265(__init__)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3266()\n", + " 7 0.000 0.000 0.000 0.000 internals.py:3307(shape)\n", + " 21 0.000 0.000 0.000 0.000 internals.py:3309()\n", + " 5 0.000 0.000 0.000 0.000 internals.py:3311(ndim)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3351(_is_single_block)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)\n", + " 11 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:3473(__len__)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:348(shape)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3488(_verify_integrity)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:3490()\n", + " 1 0.000 0.000 0.001 0.001 internals.py:3500(apply)\n", + " 5 0.000 0.000 0.000 0.000 internals.py:352(dtype)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:356(ftype)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3561()\n", + " 2 0.000 0.000 0.000 0.000 internals.py:372(iget)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:375(set)\n", + " 5 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3784(_consolidate_check)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3785()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3789(is_mixed_type)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:3895(copy)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3915()\n", + " 1 0.000 0.000 0.000 0.000 internals.py:3916()\n", + " 2 0.000 0.000 0.000 0.000 internals.py:3922(as_array)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:4085(consolidate)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:4101(_consolidate_inplace)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4108(get)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4137(iget)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4208(set)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4235(value_getitem)\n", + " 4 0.000 0.000 0.000 0.000 internals.py:4639(__init__)\n", + " 6 0.000 0.000 0.000 0.000 internals.py:4684(_block)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4709(index)\n", + " 3 0.000 0.000 0.000 0.000 internals.py:4718(dtype)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4742(external_values)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4745(internal_values)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4768(is_consolidated)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:4774(_consolidate_inplace)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:4846(create_block_manager_from_blocks)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:4869(create_block_manager_from_arrays)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:4880(form_blocks)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:4972(_simple_blockify)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:5017(_stack_arrays)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5020(_asarray_compat)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5026(_shape_compat)\n", + " 1 0.000 0.000 0.000 0.000 internals.py:5101(_extend_blocks)\n", + " 2 0.000 0.000 0.000 0.000 internals.py:5208(_get_blkno_placements)\n", + " 1 0.000 0.000 0.001 0.001 internals.py:774(copy)\n", + " 5 0.000 0.000 0.009 0.002 missing.py:112(_isna_new)\n", + " 2 0.002 0.001 0.009 0.004 missing.py:189(_isna_ndarraylike)\n", + " 1 0.000 0.000 0.000 0.000 missing.py:259(notna)\n", + " 5 0.000 0.000 0.009 0.002 missing.py:32(isna)\n", + " 1 0.000 0.000 0.000 0.000 missing.py:596(clean_reindex_fill_method)\n", + " 2 0.000 0.000 0.000 0.000 missing.py:74(clean_fill_method)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:179(_get_fill_value)\n", + " 1 0.000 0.000 0.001 0.001 nanops.py:202(_get_values)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:256(_na_ok_dtype)\n", + " 1 0.000 0.000 0.000 0.000 nanops.py:260(_view_if_needed)\n", + " 1 0.000 0.000 0.001 0.001 nanops.py:318(nanany)\n", + " 4 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:2491(seterr)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:2592(geterr)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:2887(__init__)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:2891(__enter__)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:2896(__exit__)\n", + " 3 0.000 0.000 0.000 0.000 numeric.py:433(asarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:504(asanyarray)\n", + " 1 0.000 0.000 0.000 0.000 numeric.py:630(require)\n", + " 2 0.000 0.000 0.000 0.000 numeric.py:701()\n", + " 1 0.000 0.000 0.002 0.002 ops.py:1397(_combine_series_frame)\n", + " 1 0.000 0.000 0.000 0.000 ops.py:1442(_align_method_FRAME)\n", + " 1 0.000 0.000 0.002 0.002 ops.py:1571(na_op)\n", + " 1 0.000 0.000 0.002 0.002 ops.py:1579(f)\n", + " 2 0.000 0.000 0.000 0.000 range.py:131(_simple_new)\n", + " 1 0.000 0.000 0.000 0.000 range.py:158(_validate_dtype)\n", + " 1 0.000 0.000 0.000 0.000 range.py:177(_get_data_as_items)\n", + " 1 0.000 0.000 0.000 0.000 range.py:236(dtype)\n", + " 1 0.000 0.000 0.000 0.000 range.py:240(is_unique)\n", + " 1 0.000 0.000 0.000 0.000 range.py:260(_shallow_copy)\n", + " 4 0.000 0.000 0.000 0.000 range.py:315(equals)\n", + " 35 0.000 0.000 0.000 0.000 range.py:481(__len__)\n", + " 1 0.000 0.000 0.000 0.000 range.py:491(__getitem__)\n", + " 2 0.000 0.000 0.000 0.000 range.py:68(__new__)\n", + " 2 0.000 0.000 0.000 0.000 range.py:84(_ensure_int)\n", + " 4 0.000 0.000 0.000 0.000 series.py:166(__init__)\n", + " 1 0.000 0.000 0.001 0.001 series.py:3323(reindex)\n", + " 1 0.000 0.000 0.000 0.000 series.py:349(_constructor)\n", + " 1 0.000 0.000 0.000 0.000 series.py:353(_constructor_expanddim)\n", + " 4 0.000 0.000 0.000 0.000 series.py:365(_set_axis)\n", + " 4 0.000 0.000 0.000 0.000 series.py:391(_set_subtyp)\n", + " 6 0.000 0.000 0.000 0.000 series.py:401(name)\n", + " 2 0.000 0.000 0.000 0.000 series.py:4019(_sanitize_array)\n", + " 2 0.000 0.000 0.000 0.000 series.py:4036(_try_cast)\n", + " 6 0.000 0.000 0.000 0.000 series.py:405(name)\n", + " 3 0.000 0.000 0.000 0.000 series.py:412(dtype)\n", + " 2 0.000 0.000 0.000 0.000 series.py:432(values)\n", + " 1 0.000 0.000 0.000 0.000 series.py:465(_values)\n", + " 1 0.000 0.000 0.000 0.000 series.py:562(__len__)\n", + " 1 0.000 0.000 0.000 0.000 shape_base.py:63(atleast_2d)\n", + " 1 0.000 0.000 0.119 0.119 strings.py:1345(str_split)\n", + " 98855 0.014 0.000 0.096 0.000 strings.py:1456()\n", + " 1 0.000 0.000 0.119 0.119 strings.py:148(_na_map)\n", + " 1 0.000 0.000 0.119 0.119 strings.py:153(_map)\n", + " 1 0.000 0.000 0.000 0.000 strings.py:1894(__init__)\n", + " 1 0.000 0.000 0.000 0.000 strings.py:1904(_validate)\n", + " 1 0.001 0.001 0.222 0.222 strings.py:1953(_wrap_result)\n", + " 98855 0.022 0.000 0.155 0.000 strings.py:1978(cons_row)\n", + " 1 0.016 0.016 0.172 0.172 strings.py:1984()\n", + " 98856 0.014 0.000 0.018 0.000 strings.py:1987()\n", + " 1 0.017 0.017 0.021 0.021 strings.py:1988()\n", + " 1 0.001 0.001 0.342 0.342 strings.py:2328(split)\n", + " 2 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x9cff80}\n", + " 1 0.002 0.002 0.002 0.002 {built-in method _operator.eq}\n", + " 3/2 0.000 0.000 0.000 0.000 {built-in method builtins.all}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method builtins.callable}\n", + " 1 0.000 0.000 0.353 0.353 {built-in method builtins.exec}\n", + " 107 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}\n", + " 31 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr}\n", + " 8 0.000 0.000 0.000 0.000 {built-in method builtins.hash}\n", + " 197972 0.044 0.000 0.098 0.000 {built-in method builtins.isinstance}\n", + " 41 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method builtins.iter}\n", + "197830/197823 0.008 0.000 0.008 0.000 {built-in method builtins.len}\n", + " 36 0.007 0.000 0.025 0.001 {built-in method builtins.max}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method builtins.sum}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.arange}\n", + " 8 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}\n", + " 6 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.putmask}\n", + " 4 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.geterrobj}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method numpy.core.umath.seterrobj}\n", + " 1 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_int64}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}\n", + " 2 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.infer_datetimelike_array}\n", + " 5 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_integer}\n", + " 11 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.is_scalar}\n", + " 3 0.000 0.000 0.000 0.000 {built-in method pandas._libs.missing.checknull}\n", + " 3 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}\n", + " 4 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'clear' of 'dict' objects}\n", + " 4 0.001 0.000 0.001 0.000 {method 'copy' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects}\n", + " 4 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}\n", + " 8 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}\n", + " 5 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}\n", + " 4 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}\n", + " 9 0.000 0.000 0.000 0.000 {method 'pop' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}\n", + " 3 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'reshape' of 'numpy.ndarray' objects}\n", + " 98855 0.081 0.000 0.081 0.000 {method 'split' of 'str' objects}\n", + " 2 0.000 0.000 0.000 0.000 {method 'transpose' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}\n", + " 1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}\n", + " 3 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.internals.get_blkno_indexers}\n", + " 1 0.016 0.016 0.112 0.112 {pandas._libs.lib.map_infer_mask}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.lib.maybe_convert_objects}\n", + " 1 0.003 0.003 0.003 0.003 {pandas._libs.lib.to_object_array}\n", + " 1 0.000 0.000 0.000 0.000 {pandas._libs.lib.values_from_object}\n", + " 1 0.007 0.007 0.007 0.007 {pandas._libs.missing.isnaobj}\n", + "\n", + "\n" + ] + } + ], + "source": [ + "import cProfile\n", + "\n", + "def before(df):\n", + " df[\"exists\"] = ~df.apply(lambda x: x.LanguageDesireNextYear in x.LanguageWorkedWith, axis=\"columns\")\n", + "\n", + "\n", + "def after(df):\n", + " df_split = df[\"LanguageDesireNextYear\"].str.split(\",\", expand=True)\n", + " df[\"exists\"] = df_split.isin(df[\"LanguageWorkedWith\"]).any(1)\n", + "\n", + "df.LanguageDesireNextYear = df.LanguageDesireNextYear.fillna('missing')\n", + "df.LanguageWorkedWith = df.LanguageWorkedWith.fillna('missing')\n", + "cProfile.run(\"before(df)\")\n", + "cProfile.run(\"after(df)\")" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7.59 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" + ] + } + ], + "source": [ + "%timeit before(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "147 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" + ] + } + ], + "source": [ + "%timeit after(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Do tests\n", + "\n", + "[For loops with pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 0%| | 0/15 [00:00" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import perfplot \n", + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "perfplot.show(\n", + " setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),\n", + " kernels=[\n", + " lambda df: df[df.A != df.B],\n", + " lambda df: df.query('A != B'),\n", + " lambda df: df[[x != y for x, y in zip(df.A, df.B)]]\n", + " ],\n", + " labels=['vectorized !=', 'query (numexpr)', 'list comp'],\n", + " n_range=[2**k for k in range(0, 15)],\n", + " xlabel='N'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_Crosstab_-_cross_tabulation_of_two_factors_examples.ipynb b/notebooks/pandas/Pandas_Crosstab_-_cross_tabulation_of_two_factors_examples.ipynb new file mode 100644 index 0000000..dfe2ce4 --- /dev/null +++ b/notebooks/pandas/Pandas_Crosstab_-_cross_tabulation_of_two_factors_examples.ipynb @@ -0,0 +1,2380 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas : Crosstab - cross tabulation of two (or more) factors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Resources\n", + "\n", + "* [pandas.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html)\n", + "* [Pivot table](https://en.wikipedia.org/wiki/Pivot_table)\n", + "* [imdb 5000 movie dataset](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset)\n", + "\n", + "## Official Pandas doc\n", + "\n", + ">Compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.\n", + "\n", + "## Pivot Table\n", + "\n", + "> A pivot table is a table of statistics that summarizes the data of more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.\n", + "\n", + "> Pivot tables are a technique in data processing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use cases\n", + "\n", + "* Data summary\n", + "* Data aggregation\n", + "* Grouping\n", + "* Quick Reports\n", + "* Data patterns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Import Pandas and read data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"../csv/movie_metadata.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Select data for the crosstab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
colordirector_namenum_critic_for_reviewsdurationdirector_facebook_likesactor_3_facebook_likesactor_2_nameactor_1_facebook_likesgrossgenres...num_user_for_reviewslanguagecountrycontent_ratingbudgettitle_yearactor_2_facebook_likesimdb_scoreaspect_ratiomovie_facebook_likes
0ColorJames Cameron723.0178.00.0855.0Joel David Moore1000.0760505847.0Action|Adventure|Fantasy|Sci-Fi...3054.0EnglishUSAPG-13237000000.02009.0936.07.91.7833000
1ColorGore Verbinski302.0169.0563.01000.0Orlando Bloom40000.0309404152.0Action|Adventure|Fantasy...1238.0EnglishUSAPG-13300000000.02007.05000.07.12.350
2ColorSam Mendes602.0148.00.0161.0Rory Kinnear11000.0200074175.0Action|Adventure|Thriller...994.0EnglishUKPG-13245000000.02015.0393.06.82.3585000
3ColorChristopher Nolan813.0164.022000.023000.0Christian Bale27000.0448130642.0Action|Thriller...2701.0EnglishUSAPG-13250000000.02012.023000.08.52.35164000
4NaNDoug WalkerNaNNaN131.0NaNRob Walker131.0NaNDocumentary...NaNNaNNaNNaNNaNNaN12.07.1NaN0
\n", + "

5 rows × 28 columns

\n", + "
" + ], + "text/plain": [ + " color director_name num_critic_for_reviews duration \\\n", + "0 Color James Cameron 723.0 178.0 \n", + "1 Color Gore Verbinski 302.0 169.0 \n", + "2 Color Sam Mendes 602.0 148.0 \n", + "3 Color Christopher Nolan 813.0 164.0 \n", + "4 NaN Doug Walker NaN NaN \n", + "\n", + " director_facebook_likes actor_3_facebook_likes actor_2_name \\\n", + "0 0.0 855.0 Joel David Moore \n", + "1 563.0 1000.0 Orlando Bloom \n", + "2 0.0 161.0 Rory Kinnear \n", + "3 22000.0 23000.0 Christian Bale \n", + "4 131.0 NaN Rob Walker \n", + "\n", + " actor_1_facebook_likes gross genres ... \\\n", + "0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... \n", + "1 40000.0 309404152.0 Action|Adventure|Fantasy ... \n", + "2 11000.0 200074175.0 Action|Adventure|Thriller ... \n", + "3 27000.0 448130642.0 Action|Thriller ... \n", + "4 131.0 NaN Documentary ... \n", + "\n", + " num_user_for_reviews language country content_rating budget \\\n", + "0 3054.0 English USA PG-13 237000000.0 \n", + "1 1238.0 English USA PG-13 300000000.0 \n", + "2 994.0 English UK PG-13 245000000.0 \n", + "3 2701.0 English USA PG-13 250000000.0 \n", + "4 NaN NaN NaN NaN NaN \n", + "\n", + " title_year actor_2_facebook_likes imdb_score aspect_ratio \\\n", + "0 2009.0 936.0 7.9 1.78 \n", + "1 2007.0 5000.0 7.1 2.35 \n", + "2 2015.0 393.0 6.8 2.35 \n", + "3 2012.0 23000.0 8.5 2.35 \n", + "4 NaN 12.0 7.1 NaN \n", + "\n", + " movie_facebook_likes \n", + "0 33000 \n", + "1 0 \n", + "2 85000 \n", + "3 164000 \n", + "4 0 \n", + "\n", + "[5 rows x 28 columns]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01234
colorColorColorColorColorNaN
director_nameJames CameronGore VerbinskiSam MendesChristopher NolanDoug Walker
num_critic_for_reviews723302602813NaN
duration178169148164NaN
director_facebook_likes0563022000131
actor_3_facebook_likes855100016123000NaN
actor_2_nameJoel David MooreOrlando BloomRory KinnearChristian BaleRob Walker
actor_1_facebook_likes1000400001100027000131
gross7.60506e+083.09404e+082.00074e+084.48131e+08NaN
genresAction|Adventure|Fantasy|Sci-FiAction|Adventure|FantasyAction|Adventure|ThrillerAction|ThrillerDocumentary
actor_1_nameCCH PounderJohnny DeppChristoph WaltzTom HardyDoug Walker
movie_titleAvatarPirates of the Caribbean: At World's EndSpectreThe Dark Knight RisesStar Wars: Episode VII - The Force Awakens  ...
num_voted_users88620447122027586811443378
cast_total_facebook_likes48344835011700106759143
actor_3_nameWes StudiJack DavenportStephanie SigmanJoseph Gordon-LevittNaN
facenumber_in_poster00100
plot_keywordsavatar|future|marine|native|paraplegicgoddess|marriage ceremony|marriage proposal|pi...bomb|espionage|sequel|spy|terroristdeception|imprisonment|lawlessness|police offi...NaN
movie_imdb_linkhttp://www.imdb.com/title/tt0499549/?ref_=fn_t...http://www.imdb.com/title/tt0449088/?ref_=fn_t...http://www.imdb.com/title/tt2379713/?ref_=fn_t...http://www.imdb.com/title/tt1345836/?ref_=fn_t...http://www.imdb.com/title/tt5289954/?ref_=fn_t...
num_user_for_reviews305412389942701NaN
languageEnglishEnglishEnglishEnglishNaN
countryUSAUSAUKUSANaN
content_ratingPG-13PG-13PG-13PG-13NaN
budget2.37e+083e+082.45e+082.5e+08NaN
title_year2009200720152012NaN
actor_2_facebook_likes93650003932300012
imdb_score7.97.16.88.57.1
aspect_ratio1.782.352.352.35NaN
movie_facebook_likes330000850001640000
\n", + "
" + ], + "text/plain": [ + " 0 \\\n", + "color Color \n", + "director_name James Cameron \n", + "num_critic_for_reviews 723 \n", + "duration 178 \n", + "director_facebook_likes 0 \n", + "actor_3_facebook_likes 855 \n", + "actor_2_name Joel David Moore \n", + "actor_1_facebook_likes 1000 \n", + "gross 7.60506e+08 \n", + "genres Action|Adventure|Fantasy|Sci-Fi \n", + "actor_1_name CCH Pounder \n", + "movie_title Avatar  \n", + "num_voted_users 886204 \n", + "cast_total_facebook_likes 4834 \n", + "actor_3_name Wes Studi \n", + "facenumber_in_poster 0 \n", + "plot_keywords avatar|future|marine|native|paraplegic \n", + "movie_imdb_link http://www.imdb.com/title/tt0499549/?ref_=fn_t... \n", + "num_user_for_reviews 3054 \n", + "language English \n", + "country USA \n", + "content_rating PG-13 \n", + "budget 2.37e+08 \n", + "title_year 2009 \n", + "actor_2_facebook_likes 936 \n", + "imdb_score 7.9 \n", + "aspect_ratio 1.78 \n", + "movie_facebook_likes 33000 \n", + "\n", + " 1 \\\n", + "color Color \n", + "director_name Gore Verbinski \n", + "num_critic_for_reviews 302 \n", + "duration 169 \n", + "director_facebook_likes 563 \n", + "actor_3_facebook_likes 1000 \n", + "actor_2_name Orlando Bloom \n", + "actor_1_facebook_likes 40000 \n", + "gross 3.09404e+08 \n", + "genres Action|Adventure|Fantasy \n", + "actor_1_name Johnny Depp \n", + "movie_title Pirates of the Caribbean: At World's End  \n", + "num_voted_users 471220 \n", + "cast_total_facebook_likes 48350 \n", + "actor_3_name Jack Davenport \n", + "facenumber_in_poster 0 \n", + "plot_keywords goddess|marriage ceremony|marriage proposal|pi... \n", + "movie_imdb_link http://www.imdb.com/title/tt0449088/?ref_=fn_t... \n", + "num_user_for_reviews 1238 \n", + "language English \n", + "country USA \n", + "content_rating PG-13 \n", + "budget 3e+08 \n", + "title_year 2007 \n", + "actor_2_facebook_likes 5000 \n", + "imdb_score 7.1 \n", + "aspect_ratio 2.35 \n", + "movie_facebook_likes 0 \n", + "\n", + " 2 \\\n", + "color Color \n", + "director_name Sam Mendes \n", + "num_critic_for_reviews 602 \n", + "duration 148 \n", + "director_facebook_likes 0 \n", + "actor_3_facebook_likes 161 \n", + "actor_2_name Rory Kinnear \n", + "actor_1_facebook_likes 11000 \n", + "gross 2.00074e+08 \n", + "genres Action|Adventure|Thriller \n", + "actor_1_name Christoph Waltz \n", + "movie_title Spectre  \n", + "num_voted_users 275868 \n", + "cast_total_facebook_likes 11700 \n", + "actor_3_name Stephanie Sigman \n", + "facenumber_in_poster 1 \n", + "plot_keywords bomb|espionage|sequel|spy|terrorist \n", + "movie_imdb_link http://www.imdb.com/title/tt2379713/?ref_=fn_t... \n", + "num_user_for_reviews 994 \n", + "language English \n", + "country UK \n", + "content_rating PG-13 \n", + "budget 2.45e+08 \n", + "title_year 2015 \n", + "actor_2_facebook_likes 393 \n", + "imdb_score 6.8 \n", + "aspect_ratio 2.35 \n", + "movie_facebook_likes 85000 \n", + "\n", + " 3 \\\n", + "color Color \n", + "director_name Christopher Nolan \n", + "num_critic_for_reviews 813 \n", + "duration 164 \n", + "director_facebook_likes 22000 \n", + "actor_3_facebook_likes 23000 \n", + "actor_2_name Christian Bale \n", + "actor_1_facebook_likes 27000 \n", + "gross 4.48131e+08 \n", + "genres Action|Thriller \n", + "actor_1_name Tom Hardy \n", + "movie_title The Dark Knight Rises  \n", + "num_voted_users 1144337 \n", + "cast_total_facebook_likes 106759 \n", + "actor_3_name Joseph Gordon-Levitt \n", + "facenumber_in_poster 0 \n", + "plot_keywords deception|imprisonment|lawlessness|police offi... \n", + "movie_imdb_link http://www.imdb.com/title/tt1345836/?ref_=fn_t... \n", + "num_user_for_reviews 2701 \n", + "language English \n", + "country USA \n", + "content_rating PG-13 \n", + "budget 2.5e+08 \n", + "title_year 2012 \n", + "actor_2_facebook_likes 23000 \n", + "imdb_score 8.5 \n", + "aspect_ratio 2.35 \n", + "movie_facebook_likes 164000 \n", + "\n", + " 4 \n", + "color NaN \n", + "director_name Doug Walker \n", + "num_critic_for_reviews NaN \n", + "duration NaN \n", + "director_facebook_likes 131 \n", + "actor_3_facebook_likes NaN \n", + "actor_2_name Rob Walker \n", + "actor_1_facebook_likes 131 \n", + "gross NaN \n", + "genres Documentary \n", + "actor_1_name Doug Walker \n", + "movie_title Star Wars: Episode VII - The Force Awakens  ... \n", + "num_voted_users 8 \n", + "cast_total_facebook_likes 143 \n", + "actor_3_name NaN \n", + "facenumber_in_poster 0 \n", + "plot_keywords NaN \n", + "movie_imdb_link http://www.imdb.com/title/tt5289954/?ref_=fn_t... \n", + "num_user_for_reviews NaN \n", + "language NaN \n", + "country NaN \n", + "content_rating NaN \n", + "budget NaN \n", + "title_year NaN \n", + "actor_2_facebook_likes 12 \n", + "imdb_score 7.1 \n", + "aspect_ratio NaN \n", + "movie_facebook_likes 0 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head().T" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',\n", + " 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',\n", + " 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',\n", + " 'movie_title', 'num_voted_users', 'cast_total_facebook_likes',\n", + " 'actor_3_name', 'facenumber_in_poster', 'plot_keywords',\n", + " 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',\n", + " 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',\n", + " 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],\n", + " dtype='object')" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "df2 = df.iloc[[2, 4, 9, 12, 13, 14, 20, 23, 25, 30, 34, 50, 79], :]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Step 3: Create crosstab table" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_name
Baz Luhrmann10000
Brett Ratner01000
David Yates00010
Gore Verbinski00002
Jon Favreau00010
Marc Forster00010
Peter Jackson00201
Sam Mendes00020
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director_name \n", + "Baz Luhrmann 1 0 0 0 0\n", + "Brett Ratner 0 1 0 0 0\n", + "David Yates 0 0 0 1 0\n", + "Gore Verbinski 0 0 0 0 2\n", + "Jon Favreau 0 0 0 1 0\n", + "Marc Forster 0 0 0 1 0\n", + "Peter Jackson 0 0 2 0 1\n", + "Sam Mendes 0 0 0 2 0" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# simple usage\n", + "pd.crosstab(df2['director_name'], df2['country'])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director
Baz Luhrmann10000
Brett Ratner01000
David Yates00010
Gore Verbinski00002
Jon Favreau00010
Marc Forster00010
Peter Jackson00201
Sam Mendes00020
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director \n", + "Baz Luhrmann 1 0 0 0 0\n", + "Brett Ratner 0 1 0 0 0\n", + "David Yates 0 0 0 1 0\n", + "Gore Verbinski 0 0 0 0 2\n", + "Jon Favreau 0 0 0 1 0\n", + "Marc Forster 0 0 0 1 0\n", + "Peter Jackson 0 0 2 0 1\n", + "Sam Mendes 0 0 0 2 0" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# change row and column names\n", + "pd.crosstab(df2['director_name'], df2['country'], rownames=['director'], colnames=['country'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Crosstab: normaliza or show percentage per row or total" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_name
Baz Luhrmann0.0833330.0000000.0000000.0000000.000000
Brett Ratner0.0000000.0833330.0000000.0000000.000000
David Yates0.0000000.0000000.0000000.0833330.000000
Gore Verbinski0.0000000.0000000.0000000.0000000.166667
Jon Favreau0.0000000.0000000.0000000.0833330.000000
Marc Forster0.0000000.0000000.0000000.0833330.000000
Peter Jackson0.0000000.0000000.1666670.0000000.083333
Sam Mendes0.0000000.0000000.0000000.1666670.000000
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director_name \n", + "Baz Luhrmann 0.083333 0.000000 0.000000 0.000000 0.000000\n", + "Brett Ratner 0.000000 0.083333 0.000000 0.000000 0.000000\n", + "David Yates 0.000000 0.000000 0.000000 0.083333 0.000000\n", + "Gore Verbinski 0.000000 0.000000 0.000000 0.000000 0.166667\n", + "Jon Favreau 0.000000 0.000000 0.000000 0.083333 0.000000\n", + "Marc Forster 0.000000 0.000000 0.000000 0.083333 0.000000\n", + "Peter Jackson 0.000000 0.000000 0.166667 0.000000 0.083333\n", + "Sam Mendes 0.000000 0.000000 0.000000 0.166667 0.000000" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show percentage - global - normalize=True\n", + "pd.crosstab(df2['director_name'], df2['country'], normalize=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_name
Baz Luhrmann1.00.00.0000000.00.000000
Brett Ratner0.01.00.0000000.00.000000
David Yates0.00.00.0000001.00.000000
Gore Verbinski0.00.00.0000000.01.000000
Jon Favreau0.00.00.0000001.00.000000
Marc Forster0.00.00.0000001.00.000000
Peter Jackson0.00.00.6666670.00.333333
Sam Mendes0.00.00.0000001.00.000000
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director_name \n", + "Baz Luhrmann 1.0 0.0 0.000000 0.0 0.000000\n", + "Brett Ratner 0.0 1.0 0.000000 0.0 0.000000\n", + "David Yates 0.0 0.0 0.000000 1.0 0.000000\n", + "Gore Verbinski 0.0 0.0 0.000000 0.0 1.000000\n", + "Jon Favreau 0.0 0.0 0.000000 1.0 0.000000\n", + "Marc Forster 0.0 0.0 0.000000 1.0 0.000000\n", + "Peter Jackson 0.0 0.0 0.666667 0.0 0.333333\n", + "Sam Mendes 0.0 0.0 0.000000 1.0 0.000000" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show percentage - per index - normalize='index'\n", + "pd.crosstab(df2['director_name'], df2['country'], normalize='index')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSAAll
director_name
Baz Luhrmann100001
Brett Ratner010001
David Yates000101
Gore Verbinski000022
Jon Favreau000101
Marc Forster000101
Peter Jackson002013
Sam Mendes000202
All1125312
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA All\n", + "director_name \n", + "Baz Luhrmann 1 0 0 0 0 1\n", + "Brett Ratner 0 1 0 0 0 1\n", + "David Yates 0 0 0 1 0 1\n", + "Gore Verbinski 0 0 0 0 2 2\n", + "Jon Favreau 0 0 0 1 0 1\n", + "Marc Forster 0 0 0 1 0 1\n", + "Peter Jackson 0 0 2 0 1 3\n", + "Sam Mendes 0 0 0 2 0 2\n", + "All 1 1 2 5 3 12" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Show total - margins=True\n", + "pd.crosstab(df2['director_name'], df2['country'], margins=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSAAll
director_name
Baz Luhrmann0.0833330.0000000.0000000.0000000.0000000.083333
Brett Ratner0.0000000.0833330.0000000.0000000.0000000.083333
David Yates0.0000000.0000000.0000000.0833330.0000000.083333
Gore Verbinski0.0000000.0000000.0000000.0000000.1666670.166667
Jon Favreau0.0000000.0000000.0000000.0833330.0000000.083333
Marc Forster0.0000000.0000000.0000000.0833330.0000000.083333
Peter Jackson0.0000000.0000000.1666670.0000000.0833330.250000
Sam Mendes0.0000000.0000000.0000000.1666670.0000000.166667
All0.0833330.0833330.1666670.4166670.2500001.000000
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA All\n", + "director_name \n", + "Baz Luhrmann 0.083333 0.000000 0.000000 0.000000 0.000000 0.083333\n", + "Brett Ratner 0.000000 0.083333 0.000000 0.000000 0.000000 0.083333\n", + "David Yates 0.000000 0.000000 0.000000 0.083333 0.000000 0.083333\n", + "Gore Verbinski 0.000000 0.000000 0.000000 0.000000 0.166667 0.166667\n", + "Jon Favreau 0.000000 0.000000 0.000000 0.083333 0.000000 0.083333\n", + "Marc Forster 0.000000 0.000000 0.000000 0.083333 0.000000 0.083333\n", + "Peter Jackson 0.000000 0.000000 0.166667 0.000000 0.083333 0.250000\n", + "Sam Mendes 0.000000 0.000000 0.000000 0.166667 0.000000 0.166667\n", + "All 0.083333 0.083333 0.166667 0.416667 0.250000 1.000000" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Combining totals and percentage\n", + "pd.crosstab(df2['director_name'], df2['country'], margins=True, normalize=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_name
Baz Luhrmann1.0000000.0000000.0000000.0000000.000000
Brett Ratner0.0000001.0000000.0000000.0000000.000000
David Yates0.0000000.0000000.0000001.0000000.000000
Gore Verbinski0.0000000.0000000.0000000.0000001.000000
Jon Favreau0.0000000.0000000.0000001.0000000.000000
Marc Forster0.0000000.0000000.0000001.0000000.000000
Peter Jackson0.0000000.0000000.6666670.0000000.333333
Sam Mendes0.0000000.0000000.0000001.0000000.000000
All0.0833330.0833330.1666670.4166670.250000
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director_name \n", + "Baz Luhrmann 1.000000 0.000000 0.000000 0.000000 0.000000\n", + "Brett Ratner 0.000000 1.000000 0.000000 0.000000 0.000000\n", + "David Yates 0.000000 0.000000 0.000000 1.000000 0.000000\n", + "Gore Verbinski 0.000000 0.000000 0.000000 0.000000 1.000000\n", + "Jon Favreau 0.000000 0.000000 0.000000 1.000000 0.000000\n", + "Marc Forster 0.000000 0.000000 0.000000 1.000000 0.000000\n", + "Peter Jackson 0.000000 0.000000 0.666667 0.000000 0.333333\n", + "Sam Mendes 0.000000 0.000000 0.000000 1.000000 0.000000\n", + "All 0.083333 0.083333 0.166667 0.416667 0.250000" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Combining totals and percentage per row\n", + "pd.crosstab(df2['director_name'], df2['country'], margins=True, normalize='index')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas crosstab multiple columns" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_namegenres
Baz LuhrmannDrama|Romance10000
Brett RatnerAction|Adventure|Fantasy|Sci-Fi|Thriller01000
David YatesAdventure|Family|Fantasy|Mystery00010
Gore VerbinskiAction|Adventure|Fantasy00001
Action|Adventure|Western00001
Jon FavreauAdventure|Drama|Family|Fantasy00010
Marc ForsterAction|Adventure00010
Peter JacksonAction|Adventure|Drama|Romance00100
Adventure|Fantasy00101
Sam MendesAction|Adventure|Thriller00020
\n", + "
" + ], + "text/plain": [ + "country Australia Canada \\\n", + "director_name genres \n", + "Baz Luhrmann Drama|Romance 1 0 \n", + "Brett Ratner Action|Adventure|Fantasy|Sci-Fi|Thriller 0 1 \n", + "David Yates Adventure|Family|Fantasy|Mystery 0 0 \n", + "Gore Verbinski Action|Adventure|Fantasy 0 0 \n", + " Action|Adventure|Western 0 0 \n", + "Jon Favreau Adventure|Drama|Family|Fantasy 0 0 \n", + "Marc Forster Action|Adventure 0 0 \n", + "Peter Jackson Action|Adventure|Drama|Romance 0 0 \n", + " Adventure|Fantasy 0 0 \n", + "Sam Mendes Action|Adventure|Thriller 0 0 \n", + "\n", + "country New Zealand UK USA \n", + "director_name genres \n", + "Baz Luhrmann Drama|Romance 0 0 0 \n", + "Brett Ratner Action|Adventure|Fantasy|Sci-Fi|Thriller 0 0 0 \n", + "David Yates Adventure|Family|Fantasy|Mystery 0 1 0 \n", + "Gore Verbinski Action|Adventure|Fantasy 0 0 1 \n", + " Action|Adventure|Western 0 0 1 \n", + "Jon Favreau Adventure|Drama|Family|Fantasy 0 1 0 \n", + "Marc Forster Action|Adventure 0 1 0 \n", + "Peter Jackson Action|Adventure|Drama|Romance 1 0 0 \n", + " Adventure|Fantasy 1 0 1 \n", + "Sam Mendes Action|Adventure|Thriller 0 2 0 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.crosstab([df2['director_name'], df2['genres']], df2['country'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Simulate pandas crosstab with Group By" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namecountry
director_namecountry
Baz LuhrmannAustralia11
Brett RatnerCanada11
David YatesUK11
Gore VerbinskiUSA22
Jon FavreauUK11
Marc ForsterUK11
Peter JacksonNew Zealand22
USA11
Sam MendesUK22
\n", + "
" + ], + "text/plain": [ + " director_name country\n", + "director_name country \n", + "Baz Luhrmann Australia 1 1\n", + "Brett Ratner Canada 1 1\n", + "David Yates UK 1 1\n", + "Gore Verbinski USA 2 2\n", + "Jon Favreau UK 1 1\n", + "Marc Forster UK 1 1\n", + "Peter Jackson New Zealand 2 2\n", + " USA 1 1\n", + "Sam Mendes UK 2 2" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cols = ['director_name', 'country']\n", + "df2.groupby(cols)[cols].count()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas crosstab use values from another column" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSA
director_name
Baz Luhrmann7.3NaNNaNNaNNaN
Brett RatnerNaN6.8NaNNaNNaN
David YatesNaNNaNNaN7.5NaN
Gore VerbinskiNaNNaNNaNNaN6.9
Jon FavreauNaNNaNNaN7.8NaN
Marc ForsterNaNNaNNaN6.7NaN
Peter JacksonNaNNaN7.35NaN7.9
Sam MendesNaNNaNNaN7.3NaN
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA\n", + "director_name \n", + "Baz Luhrmann 7.3 NaN NaN NaN NaN\n", + "Brett Ratner NaN 6.8 NaN NaN NaN\n", + "David Yates NaN NaN NaN 7.5 NaN\n", + "Gore Verbinski NaN NaN NaN NaN 6.9\n", + "Jon Favreau NaN NaN NaN 7.8 NaN\n", + "Marc Forster NaN NaN NaN 6.7 NaN\n", + "Peter Jackson NaN NaN 7.35 NaN 7.9\n", + "Sam Mendes NaN NaN NaN 7.3 NaN" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.average)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countryAustraliaCanadaNew ZealandUKUSAAll
director_name
Baz Luhrmann7.3NaNNaNNaNNaN7.300000
Brett RatnerNaN6.8NaNNaNNaN6.800000
David YatesNaNNaNNaN7.50NaN7.500000
Gore VerbinskiNaNNaNNaNNaN6.9000006.900000
Jon FavreauNaNNaNNaN7.80NaN7.800000
Marc ForsterNaNNaNNaN6.70NaN6.700000
Peter JacksonNaNNaN7.35NaN7.9000007.533333
Sam MendesNaNNaNNaN7.30NaN7.300000
All7.36.87.357.327.2333337.258333
\n", + "
" + ], + "text/plain": [ + "country Australia Canada New Zealand UK USA All\n", + "director_name \n", + "Baz Luhrmann 7.3 NaN NaN NaN NaN 7.300000\n", + "Brett Ratner NaN 6.8 NaN NaN NaN 6.800000\n", + "David Yates NaN NaN NaN 7.50 NaN 7.500000\n", + "Gore Verbinski NaN NaN NaN NaN 6.900000 6.900000\n", + "Jon Favreau NaN NaN NaN 7.80 NaN 7.800000\n", + "Marc Forster NaN NaN NaN 6.70 NaN 6.700000\n", + "Peter Jackson NaN NaN 7.35 NaN 7.900000 7.533333\n", + "Sam Mendes NaN NaN NaN 7.30 NaN 7.300000\n", + "All 7.3 6.8 7.35 7.32 7.233333 7.258333" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "pd.crosstab(df2['director_name'], df2['country'], values=df2.imdb_score, aggfunc=np.average, margins=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_How_add_new_column_existing_DataFrame.ipynb b/notebooks/pandas/Pandas_How_add_new_column_existing_DataFrame.ipynb new file mode 100644 index 0000000..aabb4ae --- /dev/null +++ b/notebooks/pandas/Pandas_How_add_new_column_existing_DataFrame.ipynb @@ -0,0 +1,1248 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas How to add new column to existing DataFrame\n", + "\n", + "* add completely new column\n", + "* add new column based on existing column\n", + "* matching the content of the DataFrame\n", + "\n", + "Bonus\n", + "* how to merge/concat DataFrame and Series\n", + "* read csv use converters\n", + "* join list to a DataFrame\n", + "* check dataframe for duplicated data" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def strip_space(text):\n", + " try:\n", + " return text.strip()\n", + " except AttributeError:\n", + " return text\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
0James CameronAvataravatar|future|marine|native|paraplegic237000000.02009.0
1Gore VerbinskiPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...300000000.02007.0
2Sam MendesSpectrebomb|espionage|sequel|spy|terrorist245000000.02015.0
3Christopher NolanThe Dark Knight Risesdeception|imprisonment|lawlessness|police offi...250000000.02012.0
4Doug WalkerStar Wars: Episode VII - The Force AwakensNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "0 James Cameron Avatar \n", + "1 Gore Verbinski Pirates of the Caribbean: At World's End \n", + "2 Sam Mendes Spectre \n", + "3 Christopher Nolan The Dark Knight Rises \n", + "4 Doug Walker Star Wars: Episode VII - The Force Awakens \n", + "\n", + " plot_keywords budget title_year \n", + "0 avatar|future|marine|native|paraplegic 237000000.0 2009.0 \n", + "1 goddess|marriage ceremony|marriage proposal|pi... 300000000.0 2007.0 \n", + "2 bomb|espionage|sequel|spy|terrorist 245000000.0 2015.0 \n", + "3 deception|imprisonment|lawlessness|police offi... 250000000.0 2012.0 \n", + "4 NaN NaN NaN " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset#movie_metadata.csv\n", + "\n", + "# read a dataset movies\n", + "import pandas as pd\n", + "movies = pd.read_csv('../csv/movie_metadata.csv', \n", + " usecols=['title_year', 'movie_title', 'director_name', 'plot_keywords', 'budget']\n", + " ,converters = {'movie_title' : strip_space}\n", + " )\n", + "movies.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
0James CameronAvataravatar|future|marine|native|paraplegic237000000.02009.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title plot_keywords \\\n", + "0 James Cameron Avatar  avatar|future|marine|native|paraplegic \n", + "\n", + " budget title_year \n", + "0 237000000.0 2009.0 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title == 'Avatar ']" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "director_name object\n", + "movie_title object\n", + "plot_keywords object\n", + "budget float64\n", + "title_year float64\n", + "dtype: object" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.dtypes" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "movies['movie_title'] = movies.movie_title.str.strip()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(5043, 5)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(247, 5)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title.duplicated(keep=False)].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "movies.drop_duplicates(subset=['movie_title'], keep='first', inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4916, 5)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## add completely new column" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_yeare
0James CameronAvataravatar|future|marine|native|paraplegic237000000.02009.0NaN
1Gore VerbinskiPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...300000000.02007.0NaN
2Sam MendesSpectrebomb|espionage|sequel|spy|terrorist245000000.02015.0NaN
3Christopher NolanThe Dark Knight Risesdeception|imprisonment|lawlessness|police offi...250000000.02012.0NaN
4Doug WalkerStar Wars: Episode VII - The Force AwakensNaNNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "0 James Cameron Avatar \n", + "1 Gore Verbinski Pirates of the Caribbean: At World's End \n", + "2 Sam Mendes Spectre \n", + "3 Christopher Nolan The Dark Knight Rises \n", + "4 Doug Walker Star Wars: Episode VII - The Force Awakens \n", + "\n", + " plot_keywords budget title_year \\\n", + "0 avatar|future|marine|native|paraplegic 237000000.0 2009.0 \n", + "1 goddess|marriage ceremony|marriage proposal|pi... 300000000.0 2007.0 \n", + "2 bomb|espionage|sequel|spy|terrorist 245000000.0 2015.0 \n", + "3 deception|imprisonment|lawlessness|police offi... 250000000.0 2012.0 \n", + "4 NaN NaN NaN \n", + "\n", + " e \n", + "0 NaN \n", + "1 NaN \n", + "2 NaN \n", + "3 NaN \n", + "4 NaN " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "movies['e'] = np.NaN\n", + "movies.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_yearef
0James CameronAvataravatar|future|marine|native|paraplegic237000000.02009.0NaN1
1Gore VerbinskiPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...300000000.02007.0NaN1
2Sam MendesSpectrebomb|espionage|sequel|spy|terrorist245000000.02015.0NaN1
3Christopher NolanThe Dark Knight Risesdeception|imprisonment|lawlessness|police offi...250000000.02012.0NaN1
4Doug WalkerStar Wars: Episode VII - The Force AwakensNaNNaNNaNNaN1
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "0 James Cameron Avatar \n", + "1 Gore Verbinski Pirates of the Caribbean: At World's End \n", + "2 Sam Mendes Spectre \n", + "3 Christopher Nolan The Dark Knight Rises \n", + "4 Doug Walker Star Wars: Episode VII - The Force Awakens \n", + "\n", + " plot_keywords budget title_year \\\n", + "0 avatar|future|marine|native|paraplegic 237000000.0 2009.0 \n", + "1 goddess|marriage ceremony|marriage proposal|pi... 300000000.0 2007.0 \n", + "2 bomb|espionage|sequel|spy|terrorist 245000000.0 2015.0 \n", + "3 deception|imprisonment|lawlessness|police offi... 250000000.0 2012.0 \n", + "4 NaN NaN NaN \n", + "\n", + " e f \n", + "0 NaN 1 \n", + "1 NaN 1 \n", + "2 NaN 1 \n", + "3 NaN 1 \n", + "4 NaN 1 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies['f'] = 1\n", + "movies.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## add new column based on existing column" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_yearefcentury
5038Scott SmithSigned Sealed Deliveredfraud|postal worker|prison|theft|trialNaN2013.0NaN1True
5039NaNThe Followingcult|fbi|hideout|prison escape|serial killerNaNNaNNaN1False
5040Benjamin RoberdsA Plague So PleasantNaN1400.02013.0NaN1True
5041Daniel HsiaShanghai CallingNaNNaN2012.0NaN1True
5042Jon GunnMy Date with Drewactress name in title|crush|date|four word tit...1100.02004.0NaN1True
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "5038 Scott Smith Signed Sealed Delivered \n", + "5039 NaN The Following \n", + "5040 Benjamin Roberds A Plague So Pleasant \n", + "5041 Daniel Hsia Shanghai Calling \n", + "5042 Jon Gunn My Date with Drew \n", + "\n", + " plot_keywords budget title_year \\\n", + "5038 fraud|postal worker|prison|theft|trial NaN 2013.0 \n", + "5039 cult|fbi|hideout|prison escape|serial killer NaN NaN \n", + "5040 NaN 1400.0 2013.0 \n", + "5041 NaN NaN 2012.0 \n", + "5042 actress name in title|crush|date|four word tit... 1100.0 2004.0 \n", + "\n", + " e f century \n", + "5038 NaN 1 True \n", + "5039 NaN 1 False \n", + "5040 NaN 1 True \n", + "5041 NaN 1 True \n", + "5042 NaN 1 True " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies['century'] = movies['title_year'] > 2000\n", + "movies.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_yearefcentury
5033Shane CarruthPrimerchanging the future|independent film|invention...7000.02004.0NaN121 Century
5034Neill Dela LlanaCavitejihad|mindanao|philippines|security guard|squa...7000.02005.0NaN121 Century
5035Robert RodriguezEl Mariachiassassin|death|guitar|gun|mariachi7000.01992.0NaN120 Century
5036Anthony ValloneThe Mongol Kingjewell|mongol|nostradamus|stepnicka|vallone3250.02005.0NaN121 Century
5037Edward BurnsNewlywedswritten and directed by cast member9000.02011.0NaN121 Century
5038Scott SmithSigned Sealed Deliveredfraud|postal worker|prison|theft|trialNaN2013.0NaN121 Century
5039NaNThe Followingcult|fbi|hideout|prison escape|serial killerNaNNaNNaN120 Century
5040Benjamin RoberdsA Plague So PleasantNaN1400.02013.0NaN121 Century
5041Daniel HsiaShanghai CallingNaNNaN2012.0NaN121 Century
5042Jon GunnMy Date with Drewactress name in title|crush|date|four word tit...1100.02004.0NaN121 Century
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "5033 Shane Carruth Primer \n", + "5034 Neill Dela Llana Cavite \n", + "5035 Robert Rodriguez El Mariachi \n", + "5036 Anthony Vallone The Mongol King \n", + "5037 Edward Burns Newlyweds \n", + "5038 Scott Smith Signed Sealed Delivered \n", + "5039 NaN The Following \n", + "5040 Benjamin Roberds A Plague So Pleasant \n", + "5041 Daniel Hsia Shanghai Calling \n", + "5042 Jon Gunn My Date with Drew \n", + "\n", + " plot_keywords budget title_year \\\n", + "5033 changing the future|independent film|invention... 7000.0 2004.0 \n", + "5034 jihad|mindanao|philippines|security guard|squa... 7000.0 2005.0 \n", + "5035 assassin|death|guitar|gun|mariachi 7000.0 1992.0 \n", + "5036 jewell|mongol|nostradamus|stepnicka|vallone 3250.0 2005.0 \n", + "5037 written and directed by cast member 9000.0 2011.0 \n", + "5038 fraud|postal worker|prison|theft|trial NaN 2013.0 \n", + "5039 cult|fbi|hideout|prison escape|serial killer NaN NaN \n", + "5040 NaN 1400.0 2013.0 \n", + "5041 NaN NaN 2012.0 \n", + "5042 actress name in title|crush|date|four word tit... 1100.0 2004.0 \n", + "\n", + " e f century \n", + "5033 NaN 1 21 Century \n", + "5034 NaN 1 21 Century \n", + "5035 NaN 1 20 Century \n", + "5036 NaN 1 21 Century \n", + "5037 NaN 1 21 Century \n", + "5038 NaN 1 21 Century \n", + "5039 NaN 1 20 Century \n", + "5040 NaN 1 21 Century \n", + "5041 NaN 1 21 Century \n", + "5042 NaN 1 21 Century " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies['century'] = movies.century.map({True:'21 Century', False:'20 Century'})\n", + "movies.tail(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## matching the content of the DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#movies = movies.set_index('movie_title')" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Avatar True\n", + "Spectre True\n", + "Name: watched, dtype: bool" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "watched = pd.Series([True, True], index=['Avatar', 'Spectre'], name='watched')\n", + "watched" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2,)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "watched.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4916, 8)" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n", + "of pandas will change to not sort by default.\n", + "\n", + "To accept the future behavior, pass 'sort=False'.\n", + "\n", + "To retain the current behavior and silence the warning, pass 'sort=True'.\n", + "\n", + " \"\"\"Entry point for launching an IPython kernel.\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_nameplot_keywordsbudgettitle_yearefcenturywatched
#HorrorTara Subkoffbullying|cyberbullying|girl|internet|throat sl...1500000.02015.0NaN121 CenturyNaN
10 Cloverfield LaneDan Trachtenbergalien|bunker|car crash|kidnapping|minimal cast15000000.02016.0NaN121 CenturyNaN
10 Days in a MadhouseTimothy HinesNaN12000000.02015.0NaN121 CenturyNaN
10 Things I Hate About YouGil Jungerdating|protective father|school|shrew|teen movie16000000.01999.0NaN120 CenturyNaN
10,000 B.C.Christopher BarnardNaNNaNNaNNaN120 CenturyNaN
\n", + "
" + ], + "text/plain": [ + " director_name \\\n", + "#Horror Tara Subkoff \n", + "10 Cloverfield Lane Dan Trachtenberg \n", + "10 Days in a Madhouse Timothy Hines \n", + "10 Things I Hate About You Gil Junger \n", + "10,000 B.C. Christopher Barnard \n", + "\n", + " plot_keywords \\\n", + "#Horror bullying|cyberbullying|girl|internet|throat sl... \n", + "10 Cloverfield Lane alien|bunker|car crash|kidnapping|minimal cast \n", + "10 Days in a Madhouse NaN \n", + "10 Things I Hate About You dating|protective father|school|shrew|teen movie \n", + "10,000 B.C. NaN \n", + "\n", + " budget title_year e f century watched \n", + "#Horror 1500000.0 2015.0 NaN 1 21 Century NaN \n", + "10 Cloverfield Lane 15000000.0 2016.0 NaN 1 21 Century NaN \n", + "10 Days in a Madhouse 12000000.0 2015.0 NaN 1 21 Century NaN \n", + "10 Things I Hate About You 16000000.0 1999.0 NaN 1 20 Century NaN \n", + "10,000 B.C. NaN NaN NaN 1 20 Century NaN " + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_concat = pd.concat([movies.set_index('movie_title'), watched.to_frame()], axis=1)\n", + "df_concat.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True 2\n", + "Name: watched, dtype: int64" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_concat.watched.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_nameplot_keywordsbudgettitle_yearefcenturywatched
AvatarJames Cameronavatar|future|marine|native|paraplegic237000000.02009.0NaN121 CenturyTrue
SpectreSam Mendesbomb|espionage|sequel|spy|terrorist245000000.02015.0NaN121 CenturyTrue
\n", + "
" + ], + "text/plain": [ + " director_name plot_keywords budget \\\n", + "Avatar James Cameron avatar|future|marine|native|paraplegic 237000000.0 \n", + "Spectre Sam Mendes bomb|espionage|sequel|spy|terrorist 245000000.0 \n", + "\n", + " title_year e f century watched \n", + "Avatar 2009.0 NaN 1 21 Century True \n", + "Spectre 2015.0 NaN 1 21 Century True " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_concat[df_concat.watched == True]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_Select_rows_between_two_dates_-_DataFrame_or_CSV_file.ipynb b/notebooks/pandas/Pandas_Select_rows_between_two_dates_-_DataFrame_or_CSV_file.ipynb new file mode 100644 index 0000000..d0d81d7 --- /dev/null +++ b/notebooks/pandas/Pandas_Select_rows_between_two_dates_-_DataFrame_or_CSV_file.ipynb @@ -0,0 +1,592 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas : Select rows between two dates - DataFrame or CSV file" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Resources\n", + "\n", + "* [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)\n", + "* [pandas.DataFrame.between_time](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.between_time.html)\n", + "* [pandas.DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use cases\n", + "\n", + "* Pandas: Verify columns containing dates\n", + "* Convert string to datetime in DataFrame\n", + "* Select rows between two dates\n", + " * 1. Select rows based on dates with loc\n", + " * 2. Series method between\n", + " * 3. Select rows between two times\n", + " * 4. Select rows based on dates without loc\n", + " * 5. Use mask to mark the records\n", + " * 6. Select records from last month/30 days " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Import Pandas and read data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
loading_datetimepagestitledatetime_col
02019-10-28 19:56:03main<GET https://www.wikipedia.org/> (The Free En...2019-10-29 9:06:03
12019-10-29 19:56:03english<GET https://en.wikipedia.org/wiki/Main_Page>...2019-10-31 11:16:43
22019-10-29 19:56:03italiano<GET https://it.wikipedia.org/wiki/Pagina_pri...2019-10-30 21:15:23
32019-10-30 19:56:03português<GET https://pt.wikipedia.org/wiki/Wikip%C3%A...2019-10-30 20:26:35
\n", + "
" + ], + "text/plain": [ + " loading_datetime pages \\\n", + "0 2019-10-28 19:56:03 main \n", + "1 2019-10-29 19:56:03 english \n", + "2 2019-10-29 19:56:03 italiano \n", + "3 2019-10-30 19:56:03 português \n", + "\n", + " title datetime_col \n", + "0 (The Free En... 2019-10-29 9:06:03 \n", + "1 ... 2019-10-31 11:16:43 \n", + "2 \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
loading_datetimepagestitledatetime_col
12019-10-29 19:56:03english<GET https://en.wikipedia.org/wiki/Main_Page>...2019-10-31 11:16:43+00:00
22019-10-29 19:56:03italiano<GET https://it.wikipedia.org/wiki/Pagina_pri...2019-10-30 21:15:23+00:00
\n", + "" + ], + "text/plain": [ + " loading_datetime pages \\\n", + "1 2019-10-29 19:56:03 english \n", + "2 2019-10-29 19:56:03 italiano \n", + "\n", + " title datetime_col \n", + "1 ... 2019-10-31 11:16:43+00:00 \n", + "2 start_date) & (df['datetime_col'] < end_date)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2. Series method between" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start_date = pd.to_datetime('2019-10-30 20:41', utc= True)\n", + "end_date = pd.to_datetime('5/13/2020 8:55', utc= True)\n", + "\n", + "df[df.datetime_col.between(start_date, end_date)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3. Select rows between two times" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
loading_datetimepagestitle
datetime_col
2019-10-30 21:15:23+00:002019-10-29 19:56:03italiano<GET https://it.wikipedia.org/wiki/Pagina_pri...
\n", + "
" + ], + "text/plain": [ + " loading_datetime pages \\\n", + "datetime_col \n", + "2019-10-30 21:15:23+00:00 2019-10-29 19:56:03 italiano \n", + "\n", + " title \n", + "datetime_col \n", + "2019-10-30 21:15:23+00:00 '2018-12-02') & (df['datetime_col'] <= '2018-12-03 23:26:10+00:00')]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 6. Select records from last month/30 days " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
loading_datetimepagestitledatetime_col
12019-10-29 19:56:03english<GET https://en.wikipedia.org/wiki/Main_Page>...2019-10-31 11:16:43+00:00
\n", + "
" + ], + "text/plain": [ + " loading_datetime pages \\\n", + "1 2019-10-29 19:56:03 english \n", + "\n", + " title datetime_col \n", + "1 ... 2019-10-31 11:16:43+00:00 " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df[\"datetime_col\"] >= (pd.to_datetime('11/30/2019', utc=True) - pd.Timedelta(days=30))]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_compare_columns_in_two_Dataframes.ipynb b/notebooks/pandas/Pandas_compare_columns_in_two_Dataframes.ipynb new file mode 100644 index 0000000..b060b47 --- /dev/null +++ b/notebooks/pandas/Pandas_compare_columns_in_two_Dataframes.ipynb @@ -0,0 +1,894 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "df1 = pd.read_csv('../csv/file1.csv',sep=\"\\s+\")\n", + "df2 = pd.read_csv('../csv/file2.csv',sep=\"\\s+\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalue
0Mikea+98
1Jerya-144
2Tomyb108
\n", + "
" + ], + "text/plain": [ + " name type value\n", + "0 Mike a+ 98\n", + "1 Jery a- 144\n", + "2 Tomy b 108" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
typelowhigh
0a+7897
1a-108143
2b108150
\n", + "
" + ], + "text/plain": [ + " type low high\n", + "0 a+ 78 97\n", + "1 a- 108 143\n", + "2 b 108 150" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Similar sized dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "df1['low_value'] = np.where(df1.type == df2.type, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 True\n", + "1 True\n", + "2 True\n", + "Name: low_value, dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['low_value']" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# compare using np.where whether values from first dataframe has match in the column of the second\n", + "import numpy as np\n", + "df1['low_high'] = np.where(df1.value < df2.high, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 True\n", + "Name: low_high, dtype: object" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['low_high']" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Compare one column from first against two from second dataframe\n", + "df1['low_high_value'] = np.where((df1.value >= df2.low) & (df1.value <= df2.high), 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 True\n", + "Name: low_high_value, dtype: object" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['low_high_value']" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['False', 'False', 'True'], dtype='\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalue0
0Mikea+98False
1Jerya-144False
2Tomyb108True
\n", + "" + ], + "text/plain": [ + " name type value 0\n", + "0 Mike a+ 98 False\n", + "1 Jery a- 144 False\n", + "2 Tomy b 108 True" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# compare data as Boolean Series and join it the result to first dataframe\n", + "df3 = [(df2.type.isin(df1.type)) & (df1.value.between(df2.low,df2.high,inclusive=True))]\n", + "df1.join(df3)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# compare data and assign it as new column to first data frame\n", + "df1['enh1'] = pd.Series((df2.type.isin(df1.type)) & (df1.value >= df2.low) & (df1.value <= df2.high))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalueenh1
0Mikea+98False
1Jerya-144False
2Tomyb108True
\n", + "
" + ], + "text/plain": [ + " name type value enh1\n", + "0 Mike a+ 98 False\n", + "1 Jery a- 144 False\n", + "2 Tomy b 108 True" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# compare with 3 conditions and or clause. You can use any valid python code\n", + "df1['enh2'] = pd.Series((df2.type.isin(df1.type)) & (df1.value != df2.low) | (df1.value + 1 == df2.high))" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalueenh1enh2
0Mikea+98FalseTrue
1Jerya-144FalseTrue
2Tomyb108TrueFalse
\n", + "
" + ], + "text/plain": [ + " name type value enh1 enh2\n", + "0 Mike a+ 98 False True\n", + "1 Jery a- 144 False True\n", + "2 Tomy b 108 True False" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Different sized dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# add new row for dataframe 2\n", + "df2 = df2.append({'type':'0', 'low':143, 'high':108}, ignore_index=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "merged = df1.merge(df2,how='outer',left_on=['type'],right_on=[\"type\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalueenh1enh2lowhigh
0Mikea+98.0FalseTrue7897
1Jerya-144.0FalseTrue108143
2Tomyb108.0TrueFalse108150
3NaN0NaNNaNNaN143108
\n", + "
" + ], + "text/plain": [ + " name type value enh1 enh2 low high\n", + "0 Mike a+ 98.0 False True 78 97\n", + "1 Jery a- 144.0 False True 108 143\n", + "2 Tomy b 108.0 True False 108 150\n", + "3 NaN 0 NaN NaN NaN 143 108" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypevalueenh1enh2lowhigh
2Tomyb108.0TrueFalse108150
\n", + "
" + ], + "text/plain": [ + " name type value enh1 enh2 low high\n", + "2 Tomy b 108.0 True False 108 150" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "merged[(merged.value >= merged.low) & (merged.value <= merged.high)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Error ValueError: Can only compare identically-labeled Series objects" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "Can only compare identically-labeled Series objects", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# demo of error - ValueError: Can only compare identically-labeled Series objects\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'low_high'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwhere\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mdf2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhigh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'True'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'False'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/ops/__init__.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(self, other, axis)\u001b[0m\n\u001b[1;32m 1140\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1141\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mABCSeries\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_indexed_same\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1142\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Can only compare identically-labeled \"\u001b[0m \u001b[0;34m\"Series objects\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1144\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mis_categorical_dtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mValueError\u001b[0m: Can only compare identically-labeled Series objects" + ] + } + ], + "source": [ + "# demo of error - ValueError: Can only compare identically-labeled Series objects \n", + "import numpy as np\n", + "df1['low_high'] = np.where(df1.value < df2.high, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "df2.drop(3, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "# demo of error - Now is working because of equal rows\n", + "import numpy as np\n", + "df1['low_high'] = np.where(df1.value < df2.high, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "# how to cause it on first dataframes\n", + "df1.set_index([pd.Index([1, 2, 3])], inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "ename": "ValueError", + "evalue": "Can only compare identically-labeled Series objects", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# demo of error - ValueError: Can only compare identically-labeled Series objects because of mismatching indexes\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mnumpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mdf1\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'low_high'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwhere\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mdf2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhigh\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'True'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'False'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m/home/vanx/Software/Tensorflow/environments/venv36/lib/python3.6/site-packages/pandas/core/ops/__init__.py\u001b[0m in \u001b[0;36mwrapper\u001b[0;34m(self, other, axis)\u001b[0m\n\u001b[1;32m 1140\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1141\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mABCSeries\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_indexed_same\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1142\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Can only compare identically-labeled \"\u001b[0m \u001b[0;34m\"Series objects\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1144\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mis_categorical_dtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mValueError\u001b[0m: Can only compare identically-labeled Series objects" + ] + } + ], + "source": [ + "# demo of error - ValueError: Can only compare identically-labeled Series objects because of mismatching indexes\n", + "import numpy as np\n", + "df1['low_high'] = np.where(df1.value < df2.high, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "# possible solution for - ValueError: Can only compare identically-labeled Series objects\n", + "df1.sort_index(inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "# possible solution for - ValueError: Can only compare identically-labeled Series objects\n", + "df1.reset_index(inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "# demo of error - ValueError: Can only compare identically-labeled Series objects\n", + "import numpy as np\n", + "df1['low_high'] = np.where(df1.value < df2.high, 'True', 'False')" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 True\n", + "Name: low_high, dtype: object" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df1['low_high']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_count_values_in_a_column_of_type_list.ipynb b/notebooks/pandas/Pandas_count_values_in_a_column_of_type_list.ipynb new file mode 100644 index 0000000..42f168a --- /dev/null +++ b/notebooks/pandas/Pandas_count_values_in_a_column_of_type_list.ipynb @@ -0,0 +1,2090 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pandas count values in a column of type list?\n", + "\n", + "Data set: Stack Over Flow 2018 insights\n", + "\n", + "* https://insights.stackoverflow.com/survey\n", + "* https://insights.stackoverflow.com/survey/2018#technology\n", + "\n", + "Topics\n", + "\n", + "* expand list column\n", + "* value_counts for list column\n", + "\n", + "Bonus\n", + "\n", + "* combine head and tail \n", + "* slicing iloc with range\n", + "* value_count on all columns\n", + "* sum per column\n", + "* do a sum of several columns\n", + "* sum all columns with iteration\n", + "* be careful when you chain operations with pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "pd.set_option('display.max_colwidth', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(98855, 129)\n" + ] + } + ], + "source": [ + "# read the data frame and see the data insight\n", + "df = pd.read_csv(\"../csv/stackoverflow/developer_survey_2018/survey_results_public.csv\", low_memory=False)\n", + "print(df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator...Daily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
\n", + "

2 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student Employment \\\n", + "0 1 Yes No Kenya No Employed part-time \n", + "1 3 Yes Yes United Kingdom No Employed full-time \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "1 ... Daily or almost every day Male Straight or heterosexual \n", + "\n", + " EducationParents RaceEthnicity \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent \n", + "\n", + " Age Dependents MilitaryUS \\\n", + "0 25 - 34 years old Yes NaN \n", + "1 35 - 44 years old Yes NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy \n", + "\n", + "[2 rows x 129 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator...Daily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
98853101544YesNoRussian FederationNoIndependent contractor, freelancer, or self-employedSome college/university study without earning a degreeNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
98854101548YesYesCambodiaNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

4 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student \\\n", + "0 1 Yes No Kenya No \n", + "1 3 Yes Yes United Kingdom No \n", + "98853 101544 Yes No Russian Federation No \n", + "98854 101548 Yes Yes Cambodia NaN \n", + "\n", + " Employment \\\n", + "0 Employed part-time \n", + "1 Employed full-time \n", + "98853 Independent contractor, freelancer, or self-employed \n", + "98854 NaN \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "98853 Some college/university study without earning a degree \n", + "98854 NaN \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "1 ... Daily or almost every day Male Straight or heterosexual \n", + "98853 ... NaN NaN NaN \n", + "98854 ... NaN NaN NaN \n", + "\n", + " EducationParents RaceEthnicity \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent \n", + "98853 NaN NaN \n", + "98854 NaN NaN \n", + "\n", + " Age Dependents MilitaryUS \\\n", + "0 25 - 34 years old Yes NaN \n", + "1 35 - 44 years old Yes NaN \n", + "98853 NaN NaN NaN \n", + "98854 NaN NaN NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy \n", + "98853 NaN NaN \n", + "98854 NaN NaN \n", + "\n", + "[4 rows x 129 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# combine head and tail variant 1\n", + "rows = 2\n", + "df.head(rows).append(df.tail(rows))" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator...Daily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
98853101544YesNoRussian FederationNoIndependent contractor, freelancer, or self-employedSome college/university study without earning a degreeNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
98854101548YesYesCambodiaNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

4 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student \\\n", + "0 1 Yes No Kenya No \n", + "1 3 Yes Yes United Kingdom No \n", + "98853 101544 Yes No Russian Federation No \n", + "98854 101548 Yes Yes Cambodia NaN \n", + "\n", + " Employment \\\n", + "0 Employed part-time \n", + "1 Employed full-time \n", + "98853 Independent contractor, freelancer, or self-employed \n", + "98854 NaN \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "98853 Some college/university study without earning a degree \n", + "98854 NaN \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "98853 NaN \n", + "98854 NaN \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "1 ... Daily or almost every day Male Straight or heterosexual \n", + "98853 ... NaN NaN NaN \n", + "98854 ... NaN NaN NaN \n", + "\n", + " EducationParents RaceEthnicity \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent \n", + "98853 NaN NaN \n", + "98854 NaN NaN \n", + "\n", + " Age Dependents MilitaryUS \\\n", + "0 25 - 34 years old Yes NaN \n", + "1 35 - 44 years old Yes NaN \n", + "98853 NaN NaN NaN \n", + "98854 NaN NaN NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy \n", + "98853 NaN NaN \n", + "98854 NaN NaN \n", + "\n", + "[4 rows x 129 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# combine head and tail variant 2\n", + "# ranges with iloc\n", + "rows = 2\n", + "df.iloc[np.r_[:rows, -rows:0]]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 JavaScript;Python;HTML;CSS\n", + "1 JavaScript;Python;Bash/Shell\n", + "2 NaN\n", + "3 C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell\n", + "4 C;C++;Java;Matlab;R;SQL;Bash/Shell\n", + "98850 NaN\n", + "98851 NaN\n", + "98852 NaN\n", + "98853 NaN\n", + "98854 NaN\n", + "Name: LanguageWorkedWith, dtype: object" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get examples from column LanguageWorkedWith\n", + "rows = 5\n", + "df.LanguageWorkedWith.iloc[np.r_[:rows, -rows:0]]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "C#;JavaScript;SQL;HTML;CSS 1347\n", + "JavaScript;PHP;SQL;HTML;CSS 1235\n", + "Java 1030\n", + "JavaScript;HTML;CSS 881\n", + "C#;JavaScript;SQL;TypeScript;HTML;CSS 828\n", + "C;C++;C#;Java;Python;SQL;Swift;HTML;CSS;Bash/Shell 1\n", + "C;C#;Java;JavaScript;PHP;Python;SQL;VBA;VB.NET;HTML;CSS;Bash/Shell 1\n", + "C#;Objective-C;PHP;Python;Swift;HTML;CSS;Bash/Shell 1\n", + "C#;Java;JavaScript;Objective-C;Perl;PHP;Python;SQL;Swift;TypeScript;VBA;VB.NET;HTML;CSS;Bash/Shell 1\n", + "C#;CoffeeScript;F#;JavaScript;SQL;TypeScript;HTML;CSS 1\n", + "Name: LanguageWorkedWith, dtype: int64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# value counts for the same column\n", + "df.LanguageWorkedWith.value_counts().iloc[np.r_[:rows, -rows:0]]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(98855, 38)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# expand the column on separator\n", + "df_lang = df.LanguageWorkedWith.str.split(';', expand=True)\n", + "df_lang.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(78334, 38)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_lang = df_lang.dropna(how='all')\n", + "df_lang.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123456789...28293031323334353637
0JavaScriptPythonHTMLCSSNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
1JavaScriptPythonBash/ShellNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3C#JavaScriptSQLTypeScriptHTMLCSSBash/ShellNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4CC++JavaMatlabRSQLBash/ShellNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
5JavaJavaScriptPythonTypeScriptHTMLCSSNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
\n", + "

5 rows × 38 columns

\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 6 \\\n", + "0 JavaScript Python HTML CSS None None None \n", + "1 JavaScript Python Bash/Shell None None None None \n", + "3 C# JavaScript SQL TypeScript HTML CSS Bash/Shell \n", + "4 C C++ Java Matlab R SQL Bash/Shell \n", + "5 Java JavaScript Python TypeScript HTML CSS None \n", + "\n", + " 7 8 9 ... 28 29 30 31 32 33 34 35 \\\n", + "0 None None None ... None None None None None None None None \n", + "1 None None None ... None None None None None None None None \n", + "3 None None None ... None None None None None None None None \n", + "4 None None None ... None None None None None None None None \n", + "5 None None None ... None None None None None None None None \n", + "\n", + " 36 37 \n", + "0 None None \n", + "1 None None \n", + "3 None None \n", + "4 None None \n", + "5 None None \n", + "\n", + "[5 rows x 38 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_lang.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123456789...28293031323334353637
Assembly5760.0NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Bash/Shell29.0465.01221.01929.02882.04442.04844.04269.03311.02562.0...3.01.02.02.0NaN1.0NaNNaN2.035.0
C13335.04707.0NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C#16969.04321.03990.01674.0NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C++7042.09275.03555.0NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 38 columns

\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 6 7 \\\n", + "Assembly 5760.0 NaN NaN NaN NaN NaN NaN NaN \n", + "Bash/Shell 29.0 465.0 1221.0 1929.0 2882.0 4442.0 4844.0 4269.0 \n", + "C 13335.0 4707.0 NaN NaN NaN NaN NaN NaN \n", + "C# 16969.0 4321.0 3990.0 1674.0 NaN NaN NaN NaN \n", + "C++ 7042.0 9275.0 3555.0 NaN NaN NaN NaN NaN \n", + "\n", + " 8 9 ... 28 29 30 31 32 33 34 35 36 \\\n", + "Assembly NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", + "Bash/Shell 3311.0 2562.0 ... 3.0 1.0 2.0 2.0 NaN 1.0 NaN NaN 2.0 \n", + "C NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", + "C# NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", + "C++ NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", + "\n", + " 37 \n", + "Assembly NaN \n", + "Bash/Shell 35.0 \n", + "C NaN \n", + "C# NaN \n", + "C++ NaN \n", + "\n", + "[5 rows x 38 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get languages as count / numbers\n", + "# how to use value counts for the whole dataframe\n", + "df_lang_num = df_lang.fillna(0).apply(pd.Series.value_counts)\n", + "df_lang_num.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123456789...28293031323334353637
Assembly0.073531NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Bash/Shell0.0003700.0059360.0155870.0246250.0367910.0567060.0618380.0544970.0422680.032706...0.0000380.0000130.0000260.000026NaN0.000013NaNNaN0.0000260.000447
C0.1702330.060089NaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C#0.2166240.0551610.0509360.021370NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
C++0.0898970.1184030.045383NaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 38 columns

\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 \\\n", + "Assembly 0.073531 NaN NaN NaN NaN NaN \n", + "Bash/Shell 0.000370 0.005936 0.015587 0.024625 0.036791 0.056706 \n", + "C 0.170233 0.060089 NaN NaN NaN NaN \n", + "C# 0.216624 0.055161 0.050936 0.021370 NaN NaN \n", + "C++ 0.089897 0.118403 0.045383 NaN NaN NaN \n", + "\n", + " 6 7 8 9 ... 28 29 \\\n", + "Assembly NaN NaN NaN NaN ... NaN NaN \n", + "Bash/Shell 0.061838 0.054497 0.042268 0.032706 ... 0.000038 0.000013 \n", + "C NaN NaN NaN NaN ... NaN NaN \n", + "C# NaN NaN NaN NaN ... NaN NaN \n", + "C++ NaN NaN NaN NaN ... NaN NaN \n", + "\n", + " 30 31 32 33 34 35 36 37 \n", + "Assembly NaN NaN NaN NaN NaN NaN NaN NaN \n", + "Bash/Shell 0.000026 0.000026 NaN 0.000013 NaN NaN 0.000026 0.000447 \n", + "C NaN NaN NaN NaN NaN NaN NaN NaN \n", + "C# NaN NaN NaN NaN NaN NaN NaN NaN \n", + "C++ NaN NaN NaN NaN NaN NaN NaN NaN \n", + "\n", + "[5 rows x 38 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get languages as percentage / ratio\n", + "# value counts, parameters and lambda\n", + "df_lang_per = df_lang.fillna(0).apply(lambda x: pd.value_counts(x, normalize=True))\n", + "df_lang_per.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "ename": "TypeError", + "evalue": "value_counts() missing 1 required positional argument: 'self'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# why for value counts and parameters you need lambda\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf_lang_per\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf_lang\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnormalize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;31mTypeError\u001b[0m: value_counts() missing 1 required positional argument: 'self'" + ] + } + ], + "source": [ + "# why for value counts and parameters you need lambda\n", + "df_lang_per = df_lang.fillna(0).apply(pd.Series.value_counts(normalize=True))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 31.800036\n", + "JavaScript 0.698113\n", + "HTML 0.684607\n", + "CSS 0.650790\n", + "SQL 0.570250\n", + "Java 0.453456\n", + "Bash/Shell 0.397937\n", + "Python 0.387558\n", + "C# 0.344091\n", + "PHP 0.307287\n", + "Name: total, dtype: float64" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# getting the percentage of use for each language\n", + "df_lang_per['total'] = df_lang_per.sum(axis=1)\n", + "df_lang_per.sort_values('total', ascending=False)['total'].head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2491024.0\n", + "JavaScript 54686.0\n", + "HTML 53628.0\n", + "CSS 50979.0\n", + "SQL 44670.0\n", + "Name: total, dtype: float64" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# getting the number of use for each language\n", + "df_lang_num['total'] = df_lang_num.sum(axis=1)\n", + "df_lang_num.sort_values('total', ascending=False)['total'].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123456789...28293031323334353637
0JavaScriptPythonHTMLCSSNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
1JavaScriptPythonBash/ShellNoneNoneNoneNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
3C#JavaScriptSQLTypeScriptHTMLCSSBash/ShellNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
4CC++JavaMatlabRSQLBash/ShellNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
5JavaJavaScriptPythonTypeScriptHTMLCSSNoneNoneNoneNone...NoneNoneNoneNoneNoneNoneNoneNoneNoneNone
\n", + "

5 rows × 38 columns

\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 6 \\\n", + "0 JavaScript Python HTML CSS None None None \n", + "1 JavaScript Python Bash/Shell None None None None \n", + "3 C# JavaScript SQL TypeScript HTML CSS Bash/Shell \n", + "4 C C++ Java Matlab R SQL Bash/Shell \n", + "5 Java JavaScript Python TypeScript HTML CSS None \n", + "\n", + " 7 8 9 ... 28 29 30 31 32 33 34 35 \\\n", + "0 None None None ... None None None None None None None None \n", + "1 None None None ... None None None None None None None None \n", + "3 None None None ... None None None None None None None None \n", + "4 None None None ... None None None None None None None None \n", + "5 None None None ... None None None None None None None None \n", + "\n", + " 36 37 \n", + "0 None None \n", + "1 None None \n", + "3 None None \n", + "4 None None \n", + "5 None None \n", + "\n", + "[5 rows x 38 columns]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_lang.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "C# 16969\n", + "C 13335\n", + "JavaScript 12150\n", + "Java 12087\n", + "C++ 7042\n", + "Name: 0, dtype: int64" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get value counts for first column\n", + "df_lang[0].value_counts().head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "JavaScript 19532\n", + "Java 10175\n", + "C++ 9275\n", + "PHP 6450\n", + "C 4707\n", + "Name: 1, dtype: int64" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get value counts for second column\n", + "df_lang[1].value_counts().head()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "JavaScript 48938.0\n", + "Java 32991.0\n", + "C# 26954.0\n", + "SQL 24727.0\n", + "Python 19063.0\n", + "dtype: float64" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# do a sum of several columns\n", + "df_comb_col = df_lang[0].value_counts(dropna=False) + df_lang[1].value_counts(dropna=False) + df_lang[2].value_counts(dropna=False)+ df_lang[3].value_counts(dropna=False)\n", + "df_comb_col.sort_values(ascending=False).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "df_comb = pd.DataFrame()\n", + "lang_index = []\n", + "df_lang.fillna(0, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
total
JavaScript54686
HTML53628
CSS50979
SQL44670
Java35521
Rust1857
Kotlin3508
Cobol590
Ocaml470
CSS50979
\n", + "
" + ], + "text/plain": [ + " total\n", + "JavaScript 54686\n", + "HTML 53628\n", + "CSS 50979\n", + "SQL 44670\n", + "Java 35521\n", + "Rust 1857\n", + "Kotlin 3508\n", + "Cobol 590\n", + "Ocaml 470\n", + "CSS 50979" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sum all columns in dataframe with iteration\n", + "for col in df_lang.columns:\n", + " if col == 0:\n", + " df_comb['total'] = df_lang[col].fillna(0).value_counts()\n", + " lang_index = df_lang[col].value_counts().index\n", + " else:\n", + " col_ser = df_lang[col].fillna(0).value_counts()\n", + " col_ser = col_ser.reindex(lang_index, fill_value=0)\n", + " df_comb['total'] = df_comb['total'] + col_ser\n", + "df_comb.sort_values('total', ascending=False).head(rows).append(df_comb.tail(rows))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
total
JavaScript54686
HTML53628
CSS50979
SQL44670
Java35521
Erlang886
Cobol590
Ocaml470
Julia430
Hack254
\n", + "
" + ], + "text/plain": [ + " total\n", + "JavaScript 54686\n", + "HTML 53628\n", + "CSS 50979\n", + "SQL 44670\n", + "Java 35521\n", + "Erlang 886\n", + "Cobol 590\n", + "Ocaml 470\n", + "Julia 430\n", + "Hack 254" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_comb = df_comb.sort_values('total', ascending=False)\n", + "df_comb.head(rows).append(df_comb.tail(rows))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note**: In some cases the iteration example is not working properly - when the first column doesn't contain all values. It can be replaced with the example below:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "JavaScript 54686.0\n", + "HTML 53628.0\n", + "CSS 50979.0\n", + "SQL 44670.0\n", + "Java 35521.0\n", + "Bash/Shell 31172.0\n", + "Python 30359.0\n", + "C# 26954.0\n", + "PHP 24071.0\n", + "C++ 19872.0\n", + "Delphi/Object Pascal 2025.0\n", + "Haskell 1961.0\n", + "Rust 1857.0\n", + "F# 1115.0\n", + "Clojure 1032.0\n", + "Erlang 886.0\n", + "Cobol 590.0\n", + "Ocaml 470.0\n", + "Julia 430.0\n", + "Hack 254.0\n", + "dtype: float64" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_comb = pd.DataFrame()\n", + "temp = []\n", + "val_count_tmp = pd.Series(dtype=float)\n", + "\n", + "# sum all columns in dataframe with iteration\n", + "for col in df_lang.columns:\n", + " temp.append(df_lang[col].fillna(0).value_counts())\n", + "\n", + "for val_count in temp:\n", + " val_count_tmp = val_count_tmp.add(val_count,fill_value=0)\n", + "\n", + "y = val_count_tmp.dropna().drop(0) \n", + "y.sort_values(ascending=False, inplace=True)\n", + "y.head(10).append(y.tail(10))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Pandas_extract_url_or_dates_from_column.ipynb b/notebooks/pandas/Pandas_extract_url_or_dates_from_column.ipynb new file mode 100644 index 0000000..fe659f2 --- /dev/null +++ b/notebooks/pandas/Pandas_extract_url_or_dates_from_column.ipynb @@ -0,0 +1,786 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Pandas extract URL or date by regex" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Reading the CSV file as it is\n", + "result = pd.read_csv('../csv/url_dates.csv') " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('display.max_columns', None) # or 1000\n", + "pd.set_option('display.max_rows', None) # or 1000\n", + "pd.set_option('display.max_colwidth', -1) # or 199" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
log
02019-10-28 19:56:03 DEMO <GET https://www.wikipedia.org/> (The Free Encyclopedia) 2019-10-29 9:06:03
12019-10-29 19:56:03 DEMO <GET https://en.wikipedia.org/wiki/Main_Page> (5,962,233 articles in English) 2019-10-31 11:16:43
22019-10-29 19:56:03 DEMO <GET https://it.wikipedia.org/wiki/Pagina_principale> (1 561 730 voci in italiano) 2019-10-30 21:15:23
32019-10-30 19:56:03 DEMO <GET https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal> (1 014 783 artigos em português) 2019-10-30 20:26:35
\n", + "
" + ], + "text/plain": [ + " log\n", + "0 2019-10-28 19:56:03 DEMO (The Free Encyclopedia) 2019-10-29 9:06:03 \n", + "1 2019-10-29 19:56:03 DEMO (5,962,233 articles in English) 2019-10-31 11:16:43 \n", + "2 2019-10-29 19:56:03 DEMO (1 561 730 voci in italiano) 2019-10-30 21:15:23 \n", + "3 2019-10-30 19:56:03 DEMO (1 014 783 artigos em português) 2019-10-30 20:26:35" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Checking sample data\n", + "result.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# URL extraction from Dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# extract urls by matching protocol - https and end >\n", + "# first part is a matching group while the ending is a non matching group\n", + "result['url'] = result.log.str.extract(r'(https.*)(?:>)').head()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
logurl
22019-10-29 19:56:03 DEMO <GET https://it.wikipedia.org/wiki/Pagina_principale> (1 561 730 voci in italiano) 2019-10-30 21:15:23https://it.wikipedia.org/wiki/Pagina_principale
\n", + "
" + ], + "text/plain": [ + " log \\\n", + "2 2019-10-29 19:56:03 DEMO (1 561 730 voci in italiano) 2019-10-30 21:15:23 \n", + "\n", + " url \n", + "2 https://it.wikipedia.org/wiki/Pagina_principale " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# filtering results if needed\n", + "result[result['url'].str.contains('it.wikipedia.org')]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# extract urls by matching protocol - https and end >\n", + "# first part is a matching group while the ending is a non matching group\n", + "result['url'] = result.log.str.extract(r'(https.*)(?:>)').head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
0https://www.wikipedia.org/>
1https://en.wikipedia.org/wiki/Main_Page>
2https://it.wikipedia.org/wiki/Pagina_principale>
3https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal>
\n", + "
" + ], + "text/plain": [ + " 0\n", + "0 https://www.wikipedia.org/> \n", + "1 https://en.wikipedia.org/wiki/Main_Page> \n", + "2 https://it.wikipedia.org/wiki/Pagina_principale> \n", + "3 https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal>" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# examples\n", + "\n", + "result.log.str.extract(r'(https?:\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\\.[^\\s]{2,}|www\\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\\.[^\\s]{2,}|https?:\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9]+\\.[^\\s]{2,}|www\\.[a-zA-Z0-9]+\\.[^\\s]{2,})').head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012345
0httpsNaNwww.wikipedia.org/>NaNNaNNaN
1httpsNaNen.wikipedia.org/wiki/Main_Page>NaNNaNNaN
2httpsNaNit.wikipedia.org/wiki/Pagina_principale>NaNNaNNaN
3httpsNaNpt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal>NaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " 0 1 2 \\\n", + "0 https NaN www.wikipedia.org/> \n", + "1 https NaN en.wikipedia.org/wiki/Main_Page> \n", + "2 https NaN it.wikipedia.org/wiki/Pagina_principale> \n", + "3 https NaN pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal> \n", + "\n", + " 3 4 5 \n", + "0 NaN NaN NaN \n", + "1 NaN NaN NaN \n", + "2 NaN NaN NaN \n", + "3 NaN NaN NaN " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# examples\n", + "result.log.str.extract(r'(ftp|http|https):\\/\\/(\\w+:{0,1}\\w*@)?(\\S+)(:[0-9]+)?(\\/|\\/([\\w#!:.?+=&%@!\\-\\/]))?').head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Date extraction from Dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# extract single date\n", + "result['date'] = result.log.str.extract(r'(\\d{4}-\\d{2}-\\d{2})')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2019-10-28\n", + "1 2019-10-29\n", + "2 2019-10-29\n", + "3 2019-10-30\n", + "Name: date, dtype: object" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['date']" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
match
002019-10-28
12019-10-29
102019-10-29
12019-10-31
202019-10-29
12019-10-30
302019-10-30
12019-10-30
\n", + "
" + ], + "text/plain": [ + " 0\n", + " match \n", + "0 0 2019-10-28\n", + " 1 2019-10-29\n", + "1 0 2019-10-29\n", + " 1 2019-10-31\n", + "2 0 2019-10-29\n", + " 1 2019-10-30\n", + "3 0 2019-10-30\n", + " 1 2019-10-30" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# extract multiple dates\n", + "result.log.str.extractall(r'(\\d{4}-\\d{2}-\\d{2})')" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
match01
02019-10-282019-10-29
12019-10-292019-10-31
22019-10-292019-10-30
32019-10-302019-10-30
\n", + "
" + ], + "text/plain": [ + " 0 \n", + "match 0 1\n", + "0 2019-10-28 2019-10-29\n", + "1 2019-10-29 2019-10-31\n", + "2 2019-10-29 2019-10-30\n", + "3 2019-10-30 2019-10-30" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# unstack the multiindex\n", + "result.log.str.extractall(r'(\\d{4}-\\d{2}-\\d{2})').unstack()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# extract datetime\n", + "result['datetime'] = result.log.str.extract(r'(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})')" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 2019-10-28 19:56:03\n", + "1 2019-10-29 19:56:03\n", + "2 2019-10-29 19:56:03\n", + "3 2019-10-30 19:56:03\n", + "Name: datetime, dtype: object" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['datetime']" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# match datetime extract only date\n", + "result['date'] = result.log.str.extract(r'(\\d{4}-\\d{2}-\\d{2}) (?:\\d{2}-\\d{2}-\\d{2})')" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 NaN\n", + "1 NaN\n", + "2 NaN\n", + "3 NaN\n", + "Name: date, dtype: object" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['date']" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# match datetime extract only date\n", + "result[['date', 'time']] = result.log.str.extract(r'(\\d{4}-\\d{2}-\\d{2}) (\\d{2}:\\d{2}:\\d{2})')" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datetime
02019-10-2819:56:03
12019-10-2919:56:03
22019-10-2919:56:03
32019-10-3019:56:03
\n", + "
" + ], + "text/plain": [ + " date time\n", + "0 2019-10-28 19:56:03\n", + "1 2019-10-29 19:56:03\n", + "2 2019-10-29 19:56:03\n", + "3 2019-10-30 19:56:03" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result[['date', 'time']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Split URLs" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "result['url_split'] = 'https' + result.log.str.split('https', expand=True)[1].str.split('>', expand=True)[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 https://www.wikipedia.org/ \n", + "1 https://en.wikipedia.org/wiki/Main_Page \n", + "2 https://it.wikipedia.org/wiki/Pagina_principale \n", + "3 https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal\n", + "Name: url_split, dtype: object" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result['url_split']" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/Python_Pandas_find_and_drop_duplicate_data.ipynb b/notebooks/pandas/Python_Pandas_find_and_drop_duplicate_data.ipynb new file mode 100644 index 0000000..9e9e9be --- /dev/null +++ b/notebooks/pandas/Python_Pandas_find_and_drop_duplicate_data.ipynb @@ -0,0 +1,1603 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python Pandas identify and drop duplicate data\n", + "\n", + "* identify duplicate rows in Pandas\n", + "* find duplicate values in a column\n", + "* identify duplicate values in several columns\n", + "* drop duplicated data in all columns\n", + "* drop duplicated data in several column\n", + "\n", + "Bonus\n", + "\n", + "* find duplicates in index\n", + "* find duplicate data in a row\n", + "* delete columns with duplicates" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
0James CameronAvataravatar|future|marine|native|paraplegic237000000.02009.0
1Gore VerbinskiPirates of the Caribbean: At World's Endgoddess|marriage ceremony|marriage proposal|pi...300000000.02007.0
2Sam MendesSpectrebomb|espionage|sequel|spy|terrorist245000000.02015.0
3Christopher NolanThe Dark Knight Risesdeception|imprisonment|lawlessness|police offi...250000000.02012.0
4Doug WalkerStar Wars: Episode VII - The Force AwakensNaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "0 James Cameron Avatar \n", + "1 Gore Verbinski Pirates of the Caribbean: At World's End \n", + "2 Sam Mendes Spectre \n", + "3 Christopher Nolan The Dark Knight Rises \n", + "4 Doug Walker Star Wars: Episode VII - The Force Awakens \n", + "\n", + " plot_keywords budget title_year \n", + "0 avatar|future|marine|native|paraplegic 237000000.0 2009.0 \n", + "1 goddess|marriage ceremony|marriage proposal|pi... 300000000.0 2007.0 \n", + "2 bomb|espionage|sequel|spy|terrorist 245000000.0 2015.0 \n", + "3 deception|imprisonment|lawlessness|police offi... 250000000.0 2012.0 \n", + "4 NaN NaN NaN " + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset#movie_metadata.csv\n", + "\n", + "# read a dataset movies\n", + "import pandas as pd\n", + "movies = pd.read_csv('../csv/movie_metadata.csv', \n", + " usecols=['title_year', 'movie_title', 'director_name', 'plot_keywords', 'budget']\n", + " )\n", + "movies['movie_title'] = movies.movie_title.str.strip()\n", + "movies.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## find duplicate rows in Pandas\n", + "\n", + "**subset** : column label or sequence of labels, optional\n", + "Only consider certain columns for identifying duplicates, by default use all of the columns\n", + "\n", + "**keep** : {‘first’, ‘last’, False}, default ‘first’\n", + "* first : Mark duplicates as True except for the first occurrence.\n", + "* last : Mark duplicates as True except for the last occurrence.\n", + "* False : Mark all duplicates as True." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(5043, 5)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(123, 5)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.duplicated(keep='first')].shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## find duplicate values in a column" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(127, 5)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title.duplicated(keep='first')].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
137David YatesThe Legend of Tarzanafrica|capture|jungle|male objectification|tarzan180000000.02016.0
187Bill CondonThe Twilight Saga: Breaking Dawn - Part 2battle|friend|super strength|vampire|vision120000000.02012.0
204Hideaki AnnoGodzilla Resurgenceblood|godzilla|monster|sequelNaN2016.0
303Joe WrightPan1940s|child hero|fantasy world|orphan|referenc...150000000.02015.0
389Josh TrankFantastic Fourbox office flop|critically bashed|portal|telep...120000000.02015.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "137 David Yates The Legend of Tarzan \n", + "187 Bill Condon The Twilight Saga: Breaking Dawn - Part 2 \n", + "204 Hideaki Anno Godzilla Resurgence \n", + "303 Joe Wright Pan \n", + "389 Josh Trank Fantastic Four \n", + "\n", + " plot_keywords budget \\\n", + "137 africa|capture|jungle|male objectification|tarzan 180000000.0 \n", + "187 battle|friend|super strength|vampire|vision 120000000.0 \n", + "204 blood|godzilla|monster|sequel NaN \n", + "303 1940s|child hero|fantasy world|orphan|referenc... 150000000.0 \n", + "389 box office flop|critically bashed|portal|telep... 120000000.0 \n", + "\n", + " title_year \n", + "137 2016.0 \n", + "187 2012.0 \n", + "204 2016.0 \n", + "303 2015.0 \n", + "389 2015.0 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title.duplicated(keep='first')].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(127, 5)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title.duplicated(keep='last')].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(247, 5)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title.duplicated(keep=False)].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
367Timur BekmambetovBen-HurNaNNaN2016.0
2613Timur BekmambetovBen-Hurchariot race|epic|false accusation|jerusalem|s...100000000.02016.0
3967Timur BekmambetovBen-Hurchariot race|epic|false accusation|jerusalem|s...100000000.02016.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "367 Timur Bekmambetov Ben-Hur \n", + "2613 Timur Bekmambetov Ben-Hur \n", + "3967 Timur Bekmambetov Ben-Hur \n", + "\n", + " plot_keywords budget \\\n", + "367 NaN NaN \n", + "2613 chariot race|epic|false accusation|jerusalem|s... 100000000.0 \n", + "3967 chariot race|epic|false accusation|jerusalem|s... 100000000.0 \n", + "\n", + " title_year \n", + "367 2016.0 \n", + "2613 2016.0 \n", + "3967 2016.0 " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title == 'Ben-Hur']" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
63David YatesThe Legend of Tarzanafrica|capture|jungle|male objectification|tarzan180000000.02016.0
137David YatesThe Legend of Tarzanafrica|capture|jungle|male objectification|tarzan180000000.02016.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "63 David Yates The Legend of Tarzan \n", + "137 David Yates The Legend of Tarzan \n", + "\n", + " plot_keywords budget \\\n", + "63 africa|capture|jungle|male objectification|tarzan 180000000.0 \n", + "137 africa|capture|jungle|male objectification|tarzan 180000000.0 \n", + "\n", + " title_year \n", + "63 2016.0 \n", + "137 2016.0 " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.movie_title == 'The Legend of Tarzan']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## find duplicate values in several columns" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
6Sam RaimiSpider-Man 3sandman|spider man|symbiote|venom|villain258000000.02007.0
17Joss WhedonThe Avengersalien invasion|assassin|battle|iron man|soldier220000000.02012.0
25Peter JacksonKing Konganimal name in title|ape abducts a woman|goril...207000000.02005.0
30Sam MendesSkyfallbrawl|childhood home|computer cracker|intellig...200000000.02012.0
33Tim BurtonAlice in Wonderlandalice in wonderland|mistaking reality for drea...200000000.02010.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "6 Sam Raimi Spider-Man 3 \n", + "17 Joss Whedon The Avengers \n", + "25 Peter Jackson King Kong \n", + "30 Sam Mendes Skyfall \n", + "33 Tim Burton Alice in Wonderland \n", + "\n", + " plot_keywords budget title_year \n", + "6 sandman|spider man|symbiote|venom|villain 258000000.0 2007.0 \n", + "17 alien invasion|assassin|battle|iron man|soldier 220000000.0 2012.0 \n", + "25 animal name in title|ape abducts a woman|goril... 207000000.0 2005.0 \n", + "30 brawl|childhood home|computer cracker|intellig... 200000000.0 2012.0 \n", + "33 alice in wonderland|mistaking reality for drea... 200000000.0 2010.0 " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.duplicated(subset=['movie_title', 'title_year'], keep=False)].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
director_namemovie_titleplot_keywordsbudgettitle_year
6Sam RaimiSpider-Man 3sandman|spider man|symbiote|venom|villain258000000.02007.0
17Joss WhedonThe Avengersalien invasion|assassin|battle|iron man|soldier220000000.02012.0
25Peter JacksonKing Konganimal name in title|ape abducts a woman|goril...207000000.02005.0
30Sam MendesSkyfallbrawl|childhood home|computer cracker|intellig...200000000.02012.0
33Tim BurtonAlice in Wonderlandalice in wonderland|mistaking reality for drea...200000000.02010.0
\n", + "
" + ], + "text/plain": [ + " director_name movie_title \\\n", + "6 Sam Raimi Spider-Man 3 \n", + "17 Joss Whedon The Avengers \n", + "25 Peter Jackson King Kong \n", + "30 Sam Mendes Skyfall \n", + "33 Tim Burton Alice in Wonderland \n", + "\n", + " plot_keywords budget title_year \n", + "6 sandman|spider man|symbiote|venom|villain 258000000.0 2007.0 \n", + "17 alien invasion|assassin|battle|iron man|soldier 220000000.0 2012.0 \n", + "25 animal name in title|ape abducts a woman|goril... 207000000.0 2005.0 \n", + "30 brawl|childhood home|computer cracker|intellig... 200000000.0 2012.0 \n", + "33 alice in wonderland|mistaking reality for drea... 200000000.0 2010.0 " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies[movies.duplicated(subset=['movie_title', 'director_name', 'budget'], keep=False)].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Drop duplicates" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(5043, 5)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "movies.drop_duplicates(keep='first', inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4920, 5)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "movies.drop_duplicates(subset=['movie_title', 'director_name'], keep=False, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4918, 5)" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## find duplicate data in a index" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({\"X\":[\"A\", \"XX\", \"B\", \"C\"], \"Y\":[11,\"XX\",11,12], \"Z\":[\"X\",\"XX\",\"Y\",\"X\"], 0:[0,1,1,2]})\n", + "df.set_index(0, inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
XYZ
0
0A11X
1XXXXXX
1B11Y
2C12X
\n", + "
" + ], + "text/plain": [ + " X Y Z\n", + "0 \n", + "0 A 11 X\n", + "1 XX XX XX\n", + "1 B 11 Y\n", + "2 C 12 X" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
XYZ
0
1XXXXXX
1B11Y
\n", + "
" + ], + "text/plain": [ + " X Y Z\n", + "0 \n", + "1 XX XX XX\n", + "1 B 11 Y" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.index.duplicated(keep=False)]" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "df = df[~df.index.duplicated(keep='last')]" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
XYZ
0
0A11X
1B11Y
2C12X
\n", + "
" + ], + "text/plain": [ + " X Y Z\n", + "0 \n", + "0 A 11 X\n", + "1 B 11 Y\n", + "2 C 12 X" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## find duplicate data in a row" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame({\"X\":[\"A\", \"XX\", \"B\", \"C\"], \"Y\":[11,\"XX\",11,12], \"Z\":[\"X\",\"XX\",\"Y\",\"X\"]})" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
XYZ
0A11X
1XXXXXX
2B11Y
3C12X
\n", + "
" + ], + "text/plain": [ + " X Y Z\n", + "0 A 11 X\n", + "1 XX XX XX\n", + "2 B 11 Y\n", + "3 C 12 X" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=4, step=1)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "indexes = df.index\n", + "indexes" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [], + "source": [ + "df = df.T" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0123
XAXXBC
Y11XX1112
ZXXXYX
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3\n", + "X A XX B C\n", + "Y 11 XX 11 12\n", + "Z X XX Y X" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(0, 4)" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df.duplicated(keep='first')].shape" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "X True\n", + "Y True\n", + "Z False\n", + "Name: 1, dtype: bool" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[1].duplicated(keep='last')" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "X False\n", + "Y True\n", + "Z True\n", + "Name: 1, dtype: bool" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[1].duplicated(keep='first')" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "X True\n", + "Y True\n", + "Z True\n", + "Name: 1, dtype: bool" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[1].duplicated(keep=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[1].duplicated(keep=False).sum()" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0\n", + "3\n", + "0\n", + "0\n" + ] + } + ], + "source": [ + "for i in indexes:\n", + " print(df[i].duplicated(keep=False).sum())\n", + " if df[i].duplicated(keep=False).sum() == df.shape[0]:\n", + " df.drop(i, inplace=True, axis=1)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
023
XABC
Y111112
ZXYX
\n", + "
" + ], + "text/plain": [ + " 0 2 3\n", + "X A B C\n", + "Y 11 11 12\n", + "Z X Y X" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
XYZ
0A11X
2B11Y
3C12X
\n", + "
" + ], + "text/plain": [ + " X Y Z\n", + "0 A 11 X\n", + "2 B 11 Y\n", + "3 C 12 X" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.T" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/map_the_headers_to_a_column_with_pandas.ipynb b/notebooks/pandas/map_the_headers_to_a_column_with_pandas.ipynb new file mode 100644 index 0000000..5086ed9 --- /dev/null +++ b/notebooks/pandas/map_the_headers_to_a_column_with_pandas.ipynb @@ -0,0 +1,1071 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Map the headers to a column with pandas?\n", + "\n", + "Data set: Stack Over Flow 2018 insights\n", + "\n", + "* https://insights.stackoverflow.com/survey\n", + "* https://insights.stackoverflow.com/survey/2018#technology\n", + "\n", + "Topics\n", + "\n", + "* map a headers based on a value to a new column\n", + "\n", + "Bonus\n", + "\n", + "* pandas dot method - matrix multiplication\n", + "* understand np.where\n", + "* map single column of dataframe\n", + "* map all columns of a dataframe\n", + "* map and NaN\n", + "* check all distinct values in dataframe\n", + "* Optimize big data frames:\n", + " * Columns have mixed types. Specify dtype option on import or set low_memory=False.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', -1)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(98855, 129)\n" + ] + } + ], + "source": [ + "# read the data frame and see the data insight\n", + "df = pd.read_csv(\"../csv/stackoverflow/developer_survey_2018/survey_results_public.csv\", low_memory=False)\n", + "print(df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator...Daily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
24YesYesUnited StatesNoEmployed full-timeAssociate degreeComputer science, computer engineering, or software engineering20 to 99 employeesEngineering manager;Full-stack developer...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
35NoNoUnited StatesNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)Computer science, computer engineering, or software engineering100 to 499 employeesFull-stack developer...I don't typically exerciseMaleStraight or heterosexualSome college/university study without earning a degreeWhite or of European descent35 - 44 years oldNoNoThe survey was an appropriate lengthSomewhat easy
47YesNoSouth AfricaYes, part-timeEmployed full-timeSome college/university study without earning a degreeComputer science, computer engineering, or software engineering10,000 or more employeesData or business analyst;Desktop or enterprise applications developer;Game or graphics developer;QA or test developer;Student...3 - 4 times per weekMaleStraight or heterosexualSome college/university study without earning a degreeWhite or of European descent18 - 24 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
\n", + "

5 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student \\\n", + "0 1 Yes No Kenya No \n", + "1 3 Yes Yes United Kingdom No \n", + "2 4 Yes Yes United States No \n", + "3 5 No No United States No \n", + "4 7 Yes No South Africa Yes, part-time \n", + "\n", + " Employment FormalEducation \\\n", + "0 Employed part-time Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "2 Employed full-time Associate degree \n", + "3 Employed full-time Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "4 Employed full-time Some college/university study without earning a degree \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "2 Computer science, computer engineering, or software engineering \n", + "3 Computer science, computer engineering, or software engineering \n", + "4 Computer science, computer engineering, or software engineering \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "2 20 to 99 employees \n", + "3 100 to 499 employees \n", + "4 10,000 or more employees \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "2 Engineering manager;Full-stack developer \n", + "3 Full-stack developer \n", + "4 Data or business analyst;Desktop or enterprise applications developer;Game or graphics developer;QA or test developer;Student \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "1 ... Daily or almost every day Male Straight or heterosexual \n", + "2 ... NaN NaN NaN \n", + "3 ... I don't typically exercise Male Straight or heterosexual \n", + "4 ... 3 - 4 times per week Male Straight or heterosexual \n", + "\n", + " EducationParents \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "2 NaN \n", + "3 Some college/university study without earning a degree \n", + "4 Some college/university study without earning a degree \n", + "\n", + " RaceEthnicity Age Dependents MilitaryUS \\\n", + "0 Black or of African descent 25 - 34 years old Yes NaN \n", + "1 White or of European descent 35 - 44 years old Yes NaN \n", + "2 NaN NaN NaN NaN \n", + "3 White or of European descent 35 - 44 years old No No \n", + "4 White or of European descent 18 - 24 years old Yes NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy \n", + "2 NaN NaN \n", + "3 The survey was an appropriate length Somewhat easy \n", + "4 The survey was an appropriate length Somewhat easy \n", + "\n", + "[5 rows x 129 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# examples\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HobbyOpenSourceStudent
0YesNoNo
1YesYesNo
2YesYesNo
3NoNoNo
4YesNoYes, part-time
\n", + "
" + ], + "text/plain": [ + " Hobby OpenSource Student\n", + "0 Yes No No \n", + "1 Yes Yes No \n", + "2 Yes Yes No \n", + "3 No No No \n", + "4 Yes No Yes, part-time" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create new data frame with 3 columns\n", + "columns = ['Hobby', 'OpenSource', 'Student']\n", + "df_answers = df[columns]\n", + "df_answers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.0\n", + "1 0.0\n", + "2 0.0\n", + "3 0.0\n", + "4 NaN \n", + "Name: Student, dtype: float64" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# map single column of dataframe\n", + "df_answers.Student.map( {'Yes':1, 'No':0}).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['No', 'Yes, part-time', nan, 'Yes, full-time'], dtype=object)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check all distinct values in dataframe\n", + "df_answers.Student.unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HobbyOpenSourceStudent
0100
1110
2110
3000
4100
\n", + "
" + ], + "text/plain": [ + " Hobby OpenSource Student\n", + "0 1 0 0 \n", + "1 1 1 0 \n", + "2 1 1 0 \n", + "3 0 0 0 \n", + "4 1 0 0 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# map all columns of a dataframe\n", + "import numpy as np\n", + "new_values = {'Yes':1, 'No':0, 'Yes, part-time':0, 'Yes, full-time':0, np.NaN:0}\n", + "\n", + "df_answers = df_answers.apply(lambda x: x.map( new_values ))\n", + "\n", + "df_answers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HobbyOpenSourceStudentanswer
0100Hobby
1110HobbyOpenSource
2110HobbyOpenSource
3000
4100Hobby
\n", + "
" + ], + "text/plain": [ + " Hobby OpenSource Student answer\n", + "0 1 0 0 Hobby \n", + "1 1 1 0 HobbyOpenSource\n", + "2 1 1 0 HobbyOpenSource\n", + "3 0 0 0 \n", + "4 1 0 0 Hobby " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# map headers to columns way 1\n", + "df_answers['answer'] = np.where(df_answers, df_answers.columns, '').sum(axis=1)\n", + "df_answers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(array([4, 5, 6]),)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.array([1, 2,3, 4, 9, 7, 8, 6])\n", + "np.where(a > 6)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([9, 7, 8])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.array([1, 2,3, 4, 9, 7, 8, 6])\n", + "a[np.where(a > 6)]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1, 8],\n", + " [3, 4]])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.where([[True, False], [True, True]],\n", + " [[1, 2], [3, 4]],\n", + " [[9, 8], [7, 6]])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[9, 8],\n", + " [7, 6]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.where([[False, False], [False, False]],\n", + " [[1, 2], [3, 4]],\n", + " [[9, 8], [7, 6]])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[1, 2],\n", + " [3, 4]])" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.where([[True, True], [True, True]],\n", + " [[1, 2], [3, 4]],\n", + " [[9, 8], [7, 6]])" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HobbyOpenSourceStudent
0100
1110
2110
3000
4100
\n", + "
" + ], + "text/plain": [ + " Hobby OpenSource Student\n", + "0 1 0 0 \n", + "1 1 1 0 \n", + "2 1 1 0 \n", + "3 0 0 0 \n", + "4 1 0 0 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_answers.drop('answer', axis=1, inplace=True)\n", + "df_answers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
HobbyOpenSourceStudentanswer
0100Hobby
1110HobbyOpenSource
2110HobbyOpenSource
3000
4100Hobby
\n", + "
" + ], + "text/plain": [ + " Hobby OpenSource Student answer\n", + "0 1 0 0 Hobby \n", + "1 1 1 0 HobbyOpenSource\n", + "2 1 1 0 HobbyOpenSource\n", + "3 0 0 0 \n", + "4 1 0 0 Hobby " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# map headers to columns way 2\n", + "df_answers.assign(answer=df_answers.dot(df_answers.columns)).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
012
145
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 1 2\n", + "1 4 5" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = pd.DataFrame([[1, 2], \n", + " [4, 5]])\n", + "b = pd.DataFrame([[1, 0], \n", + " [0, 1]])\n", + "a.dot(b)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
01
020
180
\n", + "
" + ], + "text/plain": [ + " 0 1\n", + "0 2 0\n", + "1 8 0" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = pd.DataFrame([[1, 2], \n", + " [4, 5]])\n", + "b = pd.DataFrame([[2, 0], \n", + " [0, 0]])\n", + "a.dot(b)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/pandas/pandas-use-list-values-select-rows-column.ipynb b/notebooks/pandas/pandas-use-list-values-select-rows-column.ipynb new file mode 100644 index 0000000..41e2e0e --- /dev/null +++ b/notebooks/pandas/pandas-use-list-values-select-rows-column.ipynb @@ -0,0 +1,1453 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Pandas use a list of values to select rows from a column\n", + "\n", + "* filter pandas rows by exact match from a list\n", + "* filter pandas rows by partial match from a list\n", + "\n", + "Bonus\n", + "\n", + "* execute value counts on multiple columns\n", + "* vectorized operations\n", + "\n", + "> Vectorization is the process of executing operations on entire arrays. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', -1)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(98855, 129)\n" + ] + } + ], + "source": [ + "# read the data frame and see the data insight\n", + "df = pd.read_csv(\"../csv/stackoverflow/developer_survey_2018/survey_results_public.csv\", low_memory=False)\n", + "print(df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
13YesYesUnited KingdomNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)A natural science (ex. biology, chemistry, physics)10,000 or more employeesDatabase administrator;DevOps specialist;Full-stack developer;System administrator...Daily or almost every dayMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent35 - 44 years oldYesNaNThe survey was an appropriate lengthSomewhat easy
\n", + "

2 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student Employment \\\n", + "0 1 Yes No Kenya No Employed part-time \n", + "1 3 Yes Yes United Kingdom No Employed full-time \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "\n", + " UndergradMajor \\\n", + "0 Mathematics or statistics \n", + "1 A natural science (ex. biology, chemistry, physics) \n", + "\n", + " CompanySize \\\n", + "0 20 to 99 employees \n", + "1 10,000 or more employees \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "1 Database administrator;DevOps specialist;Full-stack developer;System administrator \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "1 ... Daily or almost every day Male Straight or heterosexual \n", + "\n", + " EducationParents RaceEthnicity \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent \n", + "1 Bachelor’s degree (BA, BS, B.Eng., etc.) White or of European descent \n", + "\n", + " Age Dependents MilitaryUS \\\n", + "0 25 - 34 years old Yes NaN \n", + "1 35 - 44 years old Yes NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "1 The survey was an appropriate length Somewhat easy \n", + "\n", + "[2 rows x 129 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Computer science, computer engineering, or software engineering 50336\n", + "Another engineering discipline (ex. civil, electrical, mechanical) 6945 \n", + "Information systems, information technology, or system administration 6507 \n", + "A natural science (ex. biology, chemistry, physics) 3050 \n", + "Mathematics or statistics 2818 \n", + "Web development or web design 2418 \n", + "A business discipline (ex. accounting, finance, marketing) 1921 \n", + "A humanities discipline (ex. literature, history, philosophy) 1590 \n", + "A social science (ex. anthropology, psychology, political science) 1377 \n", + "Fine arts or performing arts (ex. graphic design, music, studio art) 1135 \n", + "I never declared a major 693 \n", + "A health science (ex. nursing, pharmacy, radiology) 246 \n", + "Name: UndergradMajor, dtype: int64" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.UndergradMajor.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
RespondentHobbyOpenSourceCountryStudentEmploymentFormalEducationUndergradMajorCompanySizeDevType...ExerciseGenderSexualOrientationEducationParentsRaceEthnicityAgeDependentsMilitaryUSSurveyTooLongSurveyEasy
01YesNoKenyaNoEmployed part-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics20 to 99 employeesFull-stack developer...3 - 4 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)Black or of African descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
3251YesNoUnited StatesNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)Web development or web design500 to 999 employeesBack-end developer;Designer;Front-end developer;Full-stack developer;Marketing or sales professional;Mobile developer...Daily or almost every dayFemaleStraight or heterosexualAssociate degreeWhite or of European descent18 - 24 years oldNoNoThe survey was an appropriate lengthVery easy
82124YesYesUnited KingdomNoEmployed full-timeMaster’s degree (MA, MS, M.Eng., MBA, etc.)Mathematics or statistics10,000 or more employeesBack-end developer;DevOps specialist;Front-end developer;Full-stack developer;Mobile developer...1 - 2 times per weekMaleStraight or heterosexualBachelor’s degree (BA, BS, B.Eng., etc.)White or of European descent25 - 34 years oldYesNaNThe survey was an appropriate lengthVery easy
84126YesYesArgentinaYes, part-timeEmployed full-timeSome college/university study without earning a degreeWeb development or web designFewer than 10 employeesMobile developer...1 - 2 times per weekMaleStraight or heterosexualSome college/university study without earning a degreeNaN25 - 34 years oldNoNaNThe survey was an appropriate lengthVery easy
148230YesYesUnited StatesNoEmployed full-timeBachelor’s degree (BA, BS, B.Eng., etc.)Mathematics or statistics1,000 to 4,999 employeesData scientist or machine learning specialist...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", + "

5 rows × 129 columns

\n", + "
" + ], + "text/plain": [ + " Respondent Hobby OpenSource Country Student \\\n", + "0 1 Yes No Kenya No \n", + "32 51 Yes No United States No \n", + "82 124 Yes Yes United Kingdom No \n", + "84 126 Yes Yes Argentina Yes, part-time \n", + "148 230 Yes Yes United States No \n", + "\n", + " Employment \\\n", + "0 Employed part-time \n", + "32 Employed full-time \n", + "82 Employed full-time \n", + "84 Employed full-time \n", + "148 Employed full-time \n", + "\n", + " FormalEducation \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "32 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "82 Master’s degree (MA, MS, M.Eng., MBA, etc.) \n", + "84 Some college/university study without earning a degree \n", + "148 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "\n", + " UndergradMajor CompanySize \\\n", + "0 Mathematics or statistics 20 to 99 employees \n", + "32 Web development or web design 500 to 999 employees \n", + "82 Mathematics or statistics 10,000 or more employees \n", + "84 Web development or web design Fewer than 10 employees \n", + "148 Mathematics or statistics 1,000 to 4,999 employees \n", + "\n", + " DevType \\\n", + "0 Full-stack developer \n", + "32 Back-end developer;Designer;Front-end developer;Full-stack developer;Marketing or sales professional;Mobile developer \n", + "82 Back-end developer;DevOps specialist;Front-end developer;Full-stack developer;Mobile developer \n", + "84 Mobile developer \n", + "148 Data scientist or machine learning specialist \n", + "\n", + " ... Exercise Gender SexualOrientation \\\n", + "0 ... 3 - 4 times per week Male Straight or heterosexual \n", + "32 ... Daily or almost every day Female Straight or heterosexual \n", + "82 ... 1 - 2 times per week Male Straight or heterosexual \n", + "84 ... 1 - 2 times per week Male Straight or heterosexual \n", + "148 ... NaN NaN NaN \n", + "\n", + " EducationParents \\\n", + "0 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "32 Associate degree \n", + "82 Bachelor’s degree (BA, BS, B.Eng., etc.) \n", + "84 Some college/university study without earning a degree \n", + "148 NaN \n", + "\n", + " RaceEthnicity Age Dependents MilitaryUS \\\n", + "0 Black or of African descent 25 - 34 years old Yes NaN \n", + "32 White or of European descent 18 - 24 years old No No \n", + "82 White or of European descent 25 - 34 years old Yes NaN \n", + "84 NaN 25 - 34 years old No NaN \n", + "148 NaN NaN NaN NaN \n", + "\n", + " SurveyTooLong SurveyEasy \n", + "0 The survey was an appropriate length Very easy \n", + "32 The survey was an appropriate length Very easy \n", + "82 The survey was an appropriate length Very easy \n", + "84 The survey was an appropriate length Very easy \n", + "148 NaN NaN \n", + "\n", + "[5 rows x 129 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[df['UndergradMajor'].isin(['Mathematics or statistics', \n", + " 'Web development or web design'])].head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "area_list = ['biology', 'physics', 'Computer', 'enginnering', 'pharmacy', 'psychology', 'graphic design',\n", + " 'music', 'art', 'studio art', 'accounting', 'finance', 'chemistry',]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
biologyphysicsComputerenginneringpharmacypsychologygraphic designmusicartstudio artaccountingfinancechemistry
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1TrueTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
2FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
6FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
7FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
8FalseFalseFalseFalseFalseFalseTrueTrueTrueTrueFalseFalseFalse
9FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
10FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
11FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
12FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
13FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
15FalseFalseFalseFalseFalseFalseTrueTrueTrueTrueFalseFalseFalse
16FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
17FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueTrueFalse
18NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
19FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueTrueFalse
20FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
21NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
23NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
24FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
25TrueTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
26FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
27FalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalse
28FalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
29FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", + "
" + ], + "text/plain": [ + " biology physics Computer enginnering pharmacy psychology graphic design \\\n", + "0 False False False False False False False \n", + "1 True True False False False False False \n", + "2 False False True False False False False \n", + "3 False False True False False False False \n", + "4 False False True False False False False \n", + "5 False False True False False False False \n", + "6 False False True False False False False \n", + "7 False False True False False False False \n", + "8 False False False False False False True \n", + "9 False False True False False False False \n", + "10 False False False False False False False \n", + "11 False False False False False False False \n", + "12 False False True False False False False \n", + "13 False False False False False False False \n", + "14 NaN NaN NaN NaN NaN NaN NaN \n", + "15 False False False False False False True \n", + "16 False False True False False False False \n", + "17 False False False False False False False \n", + "18 NaN NaN NaN NaN NaN NaN NaN \n", + "19 False False False False False False False \n", + "20 False False False False False False False \n", + "21 NaN NaN NaN NaN NaN NaN NaN \n", + "22 False False True False False False False \n", + "23 NaN NaN NaN NaN NaN NaN NaN \n", + "24 False False True False False False False \n", + "25 True True False False False False False \n", + "26 False False True False False False False \n", + "27 False False False False False True False \n", + "28 False False True False False False False \n", + "29 False False False False False False False \n", + "\n", + " music art studio art accounting finance chemistry \n", + "0 False False False False False False \n", + "1 False False False False False True \n", + "2 False False False False False False \n", + "3 False False False False False False \n", + "4 False False False False False False \n", + "5 False False False False False False \n", + "6 False False False False False False \n", + "7 False False False False False False \n", + "8 True True True False False False \n", + "9 False False False False False False \n", + "10 False False False False False False \n", + "11 False False False False False False \n", + "12 False False False False False False \n", + "13 False False False False False False \n", + "14 NaN NaN NaN NaN NaN NaN \n", + "15 True True True False False False \n", + "16 False False False False False False \n", + "17 False False False True True False \n", + "18 NaN NaN NaN NaN NaN NaN \n", + "19 False False False True True False \n", + "20 False False False False False False \n", + "21 NaN NaN NaN NaN NaN NaN \n", + "22 False False False False False False \n", + "23 NaN NaN NaN NaN NaN NaN \n", + "24 False False False False False False \n", + "25 False False False False False True \n", + "26 False False False False False False \n", + "27 False False False False False False \n", + "28 False False False False False False \n", + "29 False False False False False False " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import re\n", + "area_df = pd.DataFrame(dict((area, df.UndergradMajor.str.contains(area))\n", + " for area in area_list))\n", + "area_df.head(30)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Back-end developer 6417\n", + "Full-stack developer 6104\n", + "Back-end developer;Front-end developer;Full-stack developer 4460\n", + "Mobile developer 3518\n", + "Student 3222\n", + "Name: DevType, dtype: int64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.DevType.value_counts().head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "dev_list = ['Mobile', 'Data', 'QA']" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
012345678910111213141516171819
MobileFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalse
DataFalseTrueFalseFalseTrueTrueFalseFalseTrueFalseTrueFalseFalseFalseFalseFalseFalseFalseTrueFalse
QAFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue
\n", + "
" + ], + "text/plain": [ + " 0 1 2 3 4 5 6 7 8 9 \\\n", + "Mobile False False False False False False False False False False \n", + "Data False True False False True True False False True False \n", + "QA False False False False True False False True False False \n", + "\n", + " 10 11 12 13 14 15 16 17 18 19 \n", + "Mobile True False False False False False False False False False \n", + "Data True False False False False False False False True False \n", + "QA False False False False False False False False False True " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import re\n", + "dev_df = pd.DataFrame(dict((dev, df.DevType.str.contains(dev, re.IGNORECASE))\n", + " for dev in dev_list))\n", + "dev_df.head(20).T" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False 85904\n", + "True 6194 \n", + "Name: QA, dtype: int64" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dev_df.QA.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MobileDataQA
False732947020985904
True18804218896194
\n", + "
" + ], + "text/plain": [ + " Mobile Data QA\n", + "False 73294 70209 85904\n", + "True 18804 21889 6194 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dev_df.apply(pd.Series.value_counts)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MobileQA
False7329485904
True188046194
\n", + "
" + ], + "text/plain": [ + " Mobile QA\n", + "False 73294 85904\n", + "True 18804 6194 " + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dev_df[['Mobile','QA']].apply(pd.Series.value_counts)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/python/Files/How_to_merge_multiple_CSV_files_with_Python.ipynb b/notebooks/python/Files/How_to_merge_multiple_CSV_files_with_Python.ipynb new file mode 100644 index 0000000..7651657 --- /dev/null +++ b/notebooks/python/Files/How_to_merge_multiple_CSV_files_with_Python.ipynb @@ -0,0 +1,664 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to merge multiple CSV files with Python\n", + "Python convert normal JSON to JSON separated lines 3 examples\n", + "\n", + "* Steps to merge multiple CSV(identical) files with Python\n", + "* Steps to merge multiple CSV(identical) files with Python with trace\n", + "* Combine multiple CSV files when the columns are different\n", + "* Bonus: Merge multiple files with Windows/Linux" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['../../csv/data_202001.csv',\n", + " '../../csv/data_202002.csv',\n", + " '../../csv/data_201902.csv',\n", + " '../../csv/data_201901.csv']" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3col4
0EF5e5
1EEFF6ee6
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3 col4\n", + "0 E F 5 e5\n", + "1 EE FF 6 ee6" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3col5
0HJ777
1HHJJ888
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3 col5\n", + "0 H J 7 77\n", + "1 HH JJ 8 88" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
0CD3
1CCDD4
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "0 C D 3\n", + "1 CC DD 4" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3
0AB1
1AABB2
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3\n", + "0 A B 1\n", + "1 AA BB 2" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "all_files = glob.glob(os.path.join(path, \"data_*.csv\"))\n", + "display(all_files)\n", + "for f in all_files:\n", + " display(pd.read_csv(f, sep=','))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Steps to merge multiple CSV(identical) files with Python" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import os, glob\n", + "import pandas as pd\n", + "\n", + "path = \"../../csv/\"\n", + "#path = \"/home/user/data\"\n", + "\n", + "all_files = glob.glob(os.path.join(path, \"data_2019*.csv\"))\n", + "\n", + "all_csv = (pd.read_csv(f, sep=',') for f in all_files)\n", + "df_merged = pd.concat(all_csv, ignore_index=True)\n", + "df_merged.to_csv( \"merged.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0col1col2col3
00CD3
11CCDD4
22AB1
33AABB2
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 col1 col2 col3\n", + "0 0 C D 3\n", + "1 1 CC DD 4\n", + "2 2 A B 1\n", + "3 3 AA BB 2" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.read_csv('merged.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Steps to merge multiple CSV(identical) files with Python with trace" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3file
0CD3data_201902.csv
1CCDD4data_201902.csv
2AB1data_201901.csv
3AABB2data_201901.csv
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3 file\n", + "0 C D 3 data_201902.csv\n", + "1 CC DD 4 data_201902.csv\n", + "2 A B 1 data_201901.csv\n", + "3 AA BB 2 data_201901.csv" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os, glob\n", + "import pandas as pd\n", + "\n", + "path = \"../../csv/\"\n", + "\n", + "all_files = glob.glob(os.path.join(path, \"data_2019*.csv\"))\n", + "\n", + "all_df = []\n", + "for f in all_files:\n", + " df = pd.read_csv(f, sep=',')\n", + " df['file'] = f.split('/')[-1]\n", + " all_df.append(df)\n", + " \n", + "merged_df = pd.concat(all_df, ignore_index=True)\n", + "merged_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Combine multiple CSV files when the columns are different" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
col1col2col3col4col5file
0EF5e5NaNdata_202001.csv
1EEFF6ee6NaNdata_202001.csv
2HJ7NaN77.0data_202002.csv
3HHJJ8NaN88.0data_202002.csv
4CD3NaNNaNdata_201902.csv
5CCDD4NaNNaNdata_201902.csv
6AB1NaNNaNdata_201901.csv
7AABB2NaNNaNdata_201901.csv
\n", + "
" + ], + "text/plain": [ + " col1 col2 col3 col4 col5 file\n", + "0 E F 5 e5 NaN data_202001.csv\n", + "1 EE FF 6 ee6 NaN data_202001.csv\n", + "2 H J 7 NaN 77.0 data_202002.csv\n", + "3 HH JJ 8 NaN 88.0 data_202002.csv\n", + "4 C D 3 NaN NaN data_201902.csv\n", + "5 CC DD 4 NaN NaN data_201902.csv\n", + "6 A B 1 NaN NaN data_201901.csv\n", + "7 AA BB 2 NaN NaN data_201901.csv" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os, glob\n", + "import pandas as pd\n", + "\n", + "path = \"../../csv/\"\n", + "\n", + "all_files = glob.glob(os.path.join(path, \"data_*.csv\"))\n", + "\n", + "\n", + "all_df = []\n", + "for f in all_files:\n", + " df = pd.read_csv(f, sep=',')\n", + " df['file'] = f.split('/')[-1]\n", + " all_df.append(df)\n", + " \n", + "merged_df = pd.concat(all_df, ignore_index=True, sort=True)\n", + "merged_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Bonus: Merge multiple files with Windows/Linux\n", + "\n", + "Linux\n", + "\n", + "`sed 1d data_*.csv > merged.csv`\n", + "\n", + "Windows\n", + "\n", + "`C:\\> copy data_*.csv merged.csv `" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/python/JSON/41._Create_a_table_in_MySQL_Database_from_python_dictionary.ipynb b/notebooks/python/JSON/41._Create_a_table_in_MySQL_Database_from_python_dictionary.ipynb new file mode 100644 index 0000000..3bee813 --- /dev/null +++ b/notebooks/python/JSON/41._Create_a_table_in_MySQL_Database_from_python_dictionary.ipynb @@ -0,0 +1,421 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 41. Create a table in SQL(MySQL Database) from python dictionary\n", + "\n", + "\n", + "[Python convert normal JSON to JSON separated lines 3 examples](https://blog.softhints.com/python-convert-json-to-json-lines/)\n", + "\n", + "* Pandas DataFrame to MySQL\n", + "* Create table from Python Dict\n", + "* connect MySQL database and Python\n", + " * SQLAlchemy\n", + " * PyMySQL" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python dict which is converted to a Database Table\n", + "\n", + "```json\n", + "{\"id\":1,\"label\":\"A\",\"size\":\"S\"}\n", + "{\"id\":2,\"label\":\"B\",\"size\":\"XL\"}\n", + "{\"id\":3,\"label\":\"C\",\"size\":\"XXl\"}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 1: Read/Create a Python dict" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idlabelsize
01AS
12BXL
23CXXl
\n", + "
" + ], + "text/plain": [ + " id label size\n", + "0 1 A S\n", + "1 2 B XL\n", + "2 3 C XXl" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "# read normal JSON with pandas\n", + "df = pd.read_json('/home/vanx/Downloads/old/normal_json.json')\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': {0: 1, 1: 2, 2: 3},\n", + " 'label': {0: 'A', 1: 'B', 2: 'C'},\n", + " 'size': {0: 'S', 1: 'XL', 2: 'XXl'}}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_dict = df.to_dict()\n", + "data_dict" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idlabelsize
01AS
12BXL
23CXXl
\n", + "
" + ], + "text/plain": [ + " id label size\n", + "0 1 A S\n", + "1 2 B XL\n", + "2 3 C XXl" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2 = pd.DataFrame.from_dict(data_dict)\n", + "df2.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Pandas DataFrame to MySQL table with SQLAlchemy" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# connect\n", + "from sqlalchemy import create_engine\n", + "cnx = create_engine('mysql+pymysql://test:pass@localhost/test') " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# create table from DataFrame\n", + "df.to_sql('test', cnx, if_exists='replace', index = False)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idlabelsize
01AS
12BXL
23CXXl
\n", + "
" + ], + "text/plain": [ + " id label size\n", + "0 1 A S\n", + "1 2 B XL\n", + "2 3 C XXl" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# query table\n", + "df = pd.read_sql('SELECT * FROM test', cnx)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Python Dict Insert Records Into a MySQL Database" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# connect\n", + "import pymysql\n", + "\n", + "connection = pymysql.connect(host='localhost',\n", + " user='test',\n", + " password='pass',\n", + " db='test')\n", + "cursor = connection.cursor()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create table\n", + "cols = df.columns\n", + "table_name = 'test'\n", + "ddl = \"\"\n", + "for col in cols:\n", + " ddl += \"`{}` text,\".format(col)\n", + "\n", + "sql_create = \"CREATE TABLE IF NOT EXISTS `{}` ({}) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;\".format(table_name, ddl[:-1])\n", + "cursor.execute(sql_create)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "# insert data\n", + "cols = \"`,`\".join([str(i) for i in df.columns.tolist()])\n", + "\n", + "# insert dict records .\n", + "for i,row in df.iterrows():\n", + " sql = \"INSERT INTO `test` (`\" +cols + \"`) VALUES (\" + \"%s,\"*(len(row)-1) + \"%s)\"\n", + " cursor.execute(sql, tuple(row))\n", + " connection.commit()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('1', 'A', 'S')\n", + "('2', 'B', 'XL')\n", + "('3', 'C', 'XXl')\n" + ] + } + ], + "source": [ + "# read\n", + "sql = \"SELECT * FROM test\"\n", + "cursor.execute(sql)\n", + "result = cursor.fetchall()\n", + "for i in result:\n", + " print(i)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/python/JSON/42._Convert_MySQL_table_to_Pandas_DataFrame_Python_dictionary.ipynb b/notebooks/python/JSON/42._Convert_MySQL_table_to_Pandas_DataFrame_Python_dictionary.ipynb new file mode 100644 index 0000000..b762e57 --- /dev/null +++ b/notebooks/python/JSON/42._Convert_MySQL_table_to_Pandas_DataFrame_Python_dictionary.ipynb @@ -0,0 +1,222 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 42. Convert MySQL table to Pandas DataFrame(Python dictionary)\n", + "\n", + "\n", + "[How to Convert MySQL Table to Pandas DataFrame / Python Dictionary](https://blog.softhints.com/convert-mysql-table-pandas-dataframe-python-dictionary/)\n", + "\n", + "* [PyMySQL](https://pypi.org/project/PyMySQL/) + [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) - the shortest and easiest way to convert MySQL table to Python dict\n", + "* [mysql.connector](https://pypi.org/project/mysql-connector-python/)\n", + "* [pyodbc](https://pypi.org/project/pyodbc/) in order to connect to MySQL database, read table and convert it to DataFrame or Python dict." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![](https://blog.softhints.com/content/images/2020/11/MySQL_table_to_Pandas_DataFrame_to_Python_dict.png)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "password = ''" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1: Convert MySQL Table to DataFrame with PyMySQL + SQLAlchemy " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},\n", + " 'name': {0: 'Emma', 1: 'Ann', 2: 'Kim', 3: 'Olivia', 4: 'Victoria'}}" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sqlalchemy import create_engine\n", + "import pymysql\n", + "import pandas as pd\n", + "\n", + "db_connection_str = 'mysql+pymysql://root:' + password + '@localhost:3306/test'\n", + "db_connection = create_engine(db_connection_str)\n", + "\n", + "df = pd.read_sql('SELECT * FROM girls', con=db_connection)\n", + "df.to_dict()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'id': 1, 'name': 'Emma'},\n", + " {'id': 2, 'name': 'Ann'},\n", + " {'id': 3, 'name': 'Kim'},\n", + " {'id': 4, 'name': 'Olivia'},\n", + " {'id': 5, 'name': 'Victoria'}]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.to_dict('records')" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': [1, 2, 3, 4, 5], 'name': ['Emma', 'Ann', 'Kim', 'Olivia', 'Victoria']}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.to_dict('list')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: {'id': 1, 'name': 'Emma'},\n", + " 1: {'id': 2, 'name': 'Ann'},\n", + " 2: {'id': 3, 'name': 'Kim'},\n", + " 3: {'id': 4, 'name': 'Olivia'},\n", + " 4: {'id': 5, 'name': 'Victoria'}}" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.to_dict('index')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2: Convert MySQL Table to DataFrame with mysql.connector" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},\n", + " 1: {0: bytearray(b'Emma'),\n", + " 1: bytearray(b'Ann'),\n", + " 2: bytearray(b'Kim'),\n", + " 3: bytearray(b'Olivia'),\n", + " 4: bytearray(b'Victoria')}}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import mysql.connector\n", + "\n", + "# Setup MySQL connection\n", + "db = mysql.connector.connect(\n", + " host=\"localhost\", # your host, usually localhost\n", + " user=\"root\", # your username\n", + " password=password, # your password\n", + " database=\"test\" # name of the data base\n", + ") \n", + "\n", + "# You must create a Cursor object. It will let you execute all the queries you need\n", + "cur = db.cursor()\n", + "\n", + "# Use all the SQL you like\n", + "cur.execute(\"SELECT * FROM girls\")\n", + "\n", + "# Put it all to a data frame\n", + "df_sql_data = pd.DataFrame(cur.fetchall())\n", + "\n", + "# Close the session\n", + "db.close()\n", + "\n", + "# Show the data\n", + "df_sql_data.to_dict()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/python_problems/Python_problems_for_beginners_1.ipynb b/notebooks/python_problems/Python_problems_for_beginners_1.ipynb new file mode 100644 index 0000000..697835e --- /dev/null +++ b/notebooks/python_problems/Python_problems_for_beginners_1.ipynb @@ -0,0 +1,297 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Python problems for beginners" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Problem 1 Triangle\n", + "\n", + "Write a simple program that demonstrate star pattern in Python 3.x for any n:\n", + "\n", + "Example n=5\n", + "\n", + " * \n", + " * * \n", + " * * * \n", + " * * * * \n", + " * * * * * " + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "* \n", + "* * \n", + "* * * \n", + "* * * * \n", + "* * * * * \n" + ] + } + ], + "source": [ + "n = 5\n", + "\n", + "for i in range(0, n+1):\n", + " print('* ' * i)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7\n" + ] + } + ], + "source": [ + "n = n +2\n", + "print(n)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ddd" + ] + } + ], + "source": [ + "for x in ['a', 's', 'd']:\n", + " print('d', end='')" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[3, 6, 9]" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list(range(3,10,3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Problem 2 Triangle with numbers\n", + "\n", + "Write a simple program that demonstrate triangle (with numbers 0..n per line) in Python 3.x for any n:\n", + "\n", + "Example n=4\n", + "\n", + " 1 \n", + " 1 2 \n", + " 1 2 3 \n", + " 1 2 3 4" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "x\n", + "\n", + "x\n", + "y\n", + "1 \n", + "x\n", + "y\n", + "1 y\n", + "2 \n", + "x\n", + "y\n", + "1 y\n", + "2 y\n", + "3 \n", + "x\n", + "y\n", + "1 y\n", + "2 y\n", + "3 y\n", + "4 \n" + ] + } + ], + "source": [ + "n = 4\n", + "\n", + "for i in range(0, n+1):\n", + " for j in range(1, i + 1):\n", + " print(j, end=' ')\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Homework 1 Triangle with letters \n", + "\n", + "Write a simple program that demonstrate triangle (with consequtive letters) in Python 3.x for any n:\n", + "\n", + "Example n=4\n", + "\n", + " A \n", + " B C \n", + " D E F \n", + " G H I J \n", + " K L M N O " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Homework 2 Diagonal of numbers\n", + "\n", + "Write a simple program that demonstrate diagonal pattern in Python 3.x for any n:\n", + "\n", + "Example n=4\n", + "\n", + " 0\n", + " 1\n", + " 2\n", + " 3\n", + " 4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Homework 3 Pyramid\n", + "\n", + "Write a simple program that demonstrate pyramid pattern in Python 3.x for any n:\n", + "\n", + "Example n=3\n", + "\n", + " * \n", + " * * * \n", + " * * * * * " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "0 - 1\n", + "1 - 3\n", + "2 - 5\n", + "\n", + "2 * i + 1" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " * \n", + " * * * \n", + "* * * * * \n" + ] + } + ], + "source": [ + "n = 3\n", + "\n", + "for i in range(n):\n", + " row = '* ' * (2 * i + 1) # calc the * for a given row based formula\n", + " print(row.center(n * 3))" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " *\n", + " ***\n", + " *****\n", + " *******\n", + "*********\n" + ] + } + ], + "source": [ + "n = 5\n", + "\n", + "for i in range(n):\n", + " print( ' ' * (n-i-1), end='')\n", + " print('*' * (2 * i + 1))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/youtube/Youtube-PewDiePie.ipynb b/notebooks/youtube/Youtube-PewDiePie.ipynb new file mode 100644 index 0000000..8e457ed --- /dev/null +++ b/notebooks/youtube/Youtube-PewDiePie.ipynb @@ -0,0 +1,1790 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "pd.set_option('display.max_colwidth', -1)\n", + "pd.options.display.float_format = '{:,}'.format" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(143, 8)\n" + ] + } + ], + "source": [ + "df = pd.read_csv(\n", + " \"~/Projects/MYP/Datasets/Youtube/me20190528.csv\", sep=\"@\")\n", + "print(df.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDtags
0PyCharm/IntelliJ fast and auto change of the color theme41.00.00.00.02.0https://www.youtube.com/embed/SsX9Fl958W0https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg
1How to add weather desklet to Linux Mint 19291.00.00.00.00.0https://www.youtube.com/embed/-FPY_e0BdJshttps://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg
2How to easy integrate Google Calendar to Desktop for Linux Mint226.01.00.00.00.0https://www.youtube.com/embed/2evIujisdD0https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg
3Pandas use a list of values to select rows from a column45.03.00.00.010.0https://www.youtube.com/embed/jlSbo5wmTPQhttps://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg
4Pandas count and percentage by value for a column63.03.00.00.00.0https://www.youtube.com/embed/P5pxJkv71BUhttps://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg
\n", + "
" + ], + "text/plain": [ + " title Views \\\n", + "0 PyCharm/IntelliJ fast and auto change of the color theme 41.0 \n", + "1 How to add weather desklet to Linux Mint 19 291.0 \n", + "2 How to easy integrate Google Calendar to Desktop for Linux Mint 226.0 \n", + "3 Pandas use a list of values to select rows from a column 45.0 \n", + "4 Pandas count and percentage by value for a column 63.0 \n", + "\n", + " Like Dislike Favorite Comment \\\n", + "0 0.0 0.0 0.0 2.0 \n", + "1 0.0 0.0 0.0 0.0 \n", + "2 1.0 0.0 0.0 0.0 \n", + "3 3.0 0.0 0.0 10.0 \n", + "4 3.0 0.0 0.0 0.0 \n", + "\n", + " videoID \\\n", + "0 https://www.youtube.com/embed/SsX9Fl958W0 \n", + "1 https://www.youtube.com/embed/-FPY_e0BdJs \n", + "2 https://www.youtube.com/embed/2evIujisdD0 \n", + "3 https://www.youtube.com/embed/jlSbo5wmTPQ \n", + "4 https://www.youtube.com/embed/P5pxJkv71BU \n", + "\n", + " tags \n", + "0 https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg \n", + "1 https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg \n", + "2 https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg \n", + "3 https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg \n", + "4 https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dropna()\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(143, 8)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "df3 = df.tags.str.split(',', expand=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
0True
1True
2True
3True
4True
\n", + "
" + ], + "text/plain": [ + " 0\n", + "0 True\n", + "1 True\n", + "2 True\n", + "3 True\n", + "4 True" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2 = df.tags.str.split(',', expand=True).notna()\n", + "df2.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "RangeIndex(start=0, stop=1, step=1)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "columns = df2.columns\n", + "columns" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
0True
1True
2True
3True
4True
\n", + "
" + ], + "text/plain": [ + " 0\n", + "0 True\n", + "1 True\n", + "2 True\n", + "3 True\n", + "4 True" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
0https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg
1https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg
2https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg
3https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg
4https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg
\n", + "
" + ], + "text/plain": [ + " 0\n", + "0 https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg\n", + "1 https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg\n", + "2 https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg\n", + "3 https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg\n", + "4 https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df3.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ssssssssssssssssssssssssssssssssss0ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg\n", + "Name: 0, dtype: object\n", + "ssssssssssssssssssssssssssssssssss1ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg\n", + "Name: 1, dtype: object\n", + "ssssssssssssssssssssssssssssssssss2ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg\n", + "Name: 2, dtype: object\n", + "ssssssssssssssssssssssssssssssssss3ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg\n", + "Name: 3, dtype: object\n", + "ssssssssssssssssssssssssssssssssss4ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg\n", + "Name: 4, dtype: object\n", + "ssssssssssssssssssssssssssssssssss5ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Ni2SjEuz__g/hqdefault.jpg\n", + "Name: 5, dtype: object\n", + "ssssssssssssssssssssssssssssssssss6ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/EXxJ-We2ygw/hqdefault.jpg\n", + "Name: 6, dtype: object\n", + "ssssssssssssssssssssssssssssssssss7ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/tfU8pDNYlDA/hqdefault.jpg\n", + "Name: 7, dtype: object\n", + "ssssssssssssssssssssssssssssssssss8ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/nW5ltiwV-6Y/hqdefault.jpg\n", + "Name: 8, dtype: object\n", + "ssssssssssssssssssssssssssssssssss9ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Z1vISDOhC0k/hqdefault.jpg\n", + "Name: 9, dtype: object\n", + "ssssssssssssssssssssssssssssssssss10ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/lx7KFd6BPcg/hqdefault.jpg\n", + "Name: 10, dtype: object\n", + "ssssssssssssssssssssssssssssssssss11ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/3g6KG_8zq0E/hqdefault.jpg\n", + "Name: 11, dtype: object\n", + "ssssssssssssssssssssssssssssssssss12ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/-NVFQ_q3eRM/hqdefault.jpg\n", + "Name: 12, dtype: object\n", + "ssssssssssssssssssssssssssssssssss13ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/CA6lyOmfRbM/hqdefault.jpg\n", + "Name: 13, dtype: object\n", + "ssssssssssssssssssssssssssssssssss14ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/PIAzK1rvqIY/hqdefault.jpg\n", + "Name: 14, dtype: object\n", + "ssssssssssssssssssssssssssssssssss15ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/nrF_Rgh88no/hqdefault.jpg\n", + "Name: 15, dtype: object\n", + "ssssssssssssssssssssssssssssssssss16ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/4ixLp8aFomw/hqdefault.jpg\n", + "Name: 16, dtype: object\n", + "ssssssssssssssssssssssssssssssssss17ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/UvCO5gKQqtE/hqdefault.jpg\n", + "Name: 17, dtype: object\n", + "ssssssssssssssssssssssssssssssssss18ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/j80mqdfy8Fw/hqdefault.jpg\n", + "Name: 18, dtype: object\n", + "ssssssssssssssssssssssssssssssssss19ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/bKBpDywKje8/hqdefault.jpg\n", + "Name: 19, dtype: object\n", + "ssssssssssssssssssssssssssssssssss20ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 20, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss21ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/t_DI7NbjcFs/hqdefault.jpg\n", + "Name: 21, dtype: object\n", + "ssssssssssssssssssssssssssssssssss22ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Ol3Dwucax9U/hqdefault.jpg\n", + "Name: 22, dtype: object\n", + "ssssssssssssssssssssssssssssssssss23ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/NbvHU_KoD74/hqdefault.jpg\n", + "Name: 23, dtype: object\n", + "ssssssssssssssssssssssssssssssssss24ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/zVQJQxpedm8/hqdefault.jpg\n", + "Name: 24, dtype: object\n", + "ssssssssssssssssssssssssssssssssss25ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/lCcE-0bykRU/hqdefault.jpg\n", + "Name: 25, dtype: object\n", + "ssssssssssssssssssssssssssssssssss26ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/seLcRCulwl4/hqdefault.jpg\n", + "Name: 26, dtype: object\n", + "ssssssssssssssssssssssssssssssssss27ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/ZfemCpfJNfU/hqdefault.jpg\n", + "Name: 27, dtype: object\n", + "ssssssssssssssssssssssssssssssssss28ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/TgO-AkopLo4/hqdefault.jpg\n", + "Name: 28, dtype: object\n", + "ssssssssssssssssssssssssssssssssss29ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/HMB4zrP_-HY/hqdefault.jpg\n", + "Name: 29, dtype: object\n", + "ssssssssssssssssssssssssssssssssss30ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/JBm8iptLnuA/hqdefault.jpg\n", + "Name: 30, dtype: object\n", + "ssssssssssssssssssssssssssssssssss31ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Ynp0xyBgwt0/hqdefault.jpg\n", + "Name: 31, dtype: object\n", + "ssssssssssssssssssssssssssssssssss32ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/ftGiBv3LL_A/hqdefault.jpg\n", + "Name: 32, dtype: object\n", + "ssssssssssssssssssssssssssssssssss33ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/5pbRivDYzko/hqdefault.jpg\n", + "Name: 33, dtype: object\n", + "ssssssssssssssssssssssssssssssssss34ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/3jlXIX5Ctyo/hqdefault.jpg\n", + "Name: 34, dtype: object\n", + "ssssssssssssssssssssssssssssssssss35ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/mG9OnH9R5yM/hqdefault.jpg\n", + "Name: 35, dtype: object\n", + "ssssssssssssssssssssssssssssssssss36ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/SnMXqyLqZwM/hqdefault.jpg\n", + "Name: 36, dtype: object\n", + "ssssssssssssssssssssssssssssssssss37ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/30ndwJm1I5c/hqdefault.jpg\n", + "Name: 37, dtype: object\n", + "ssssssssssssssssssssssssssssssssss38ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/IoeYrz-fP2o/hqdefault.jpg\n", + "Name: 38, dtype: object\n", + "ssssssssssssssssssssssssssssssssss39ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 39, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss40ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/hJMH_1o8eU0/hqdefault.jpg\n", + "Name: 40, dtype: object\n", + "ssssssssssssssssssssssssssssssssss41ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/OXA_ZD1gR6A/hqdefault.jpg\n", + "Name: 41, dtype: object\n", + "ssssssssssssssssssssssssssssssssss42ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/duOHHDqI40c/hqdefault.jpg\n", + "Name: 42, dtype: object\n", + "ssssssssssssssssssssssssssssssssss43ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/vbHFIALhSWE/hqdefault.jpg\n", + "Name: 43, dtype: object\n", + "ssssssssssssssssssssssssssssssssss44ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/ZWytZoEVpGU/hqdefault.jpg\n", + "Name: 44, dtype: object\n", + "ssssssssssssssssssssssssssssssssss45ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/uoAV7651Op0/hqdefault.jpg\n", + "Name: 45, dtype: object\n", + "ssssssssssssssssssssssssssssssssss46ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/702lkQbZx50/hqdefault.jpg\n", + "Name: 46, dtype: object\n", + "ssssssssssssssssssssssssssssssssss47ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/7sgDvC4k6Xg/hqdefault.jpg\n", + "Name: 47, dtype: object\n", + "ssssssssssssssssssssssssssssssssss48ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/cCoGsFVPVh0/hqdefault.jpg\n", + "Name: 48, dtype: object\n", + "ssssssssssssssssssssssssssssssssss49ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 49, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss50ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Odog86JslbA/hqdefault.jpg\n", + "Name: 50, dtype: object\n", + "ssssssssssssssssssssssssssssssssss51ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/SZO8jF9Z6vw/hqdefault.jpg\n", + "Name: 51, dtype: object\n", + "ssssssssssssssssssssssssssssssssss52ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/dAKyi8aFq3Y/hqdefault.jpg\n", + "Name: 52, dtype: object\n", + "ssssssssssssssssssssssssssssssssss53ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/GskbfPKP35E/hqdefault.jpg\n", + "Name: 53, dtype: object\n", + "ssssssssssssssssssssssssssssssssss54ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/sVxLiftJGbU/hqdefault.jpg\n", + "Name: 54, dtype: object\n", + "ssssssssssssssssssssssssssssssssss55ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/0k0fvqikaoE/hqdefault.jpg\n", + "Name: 55, dtype: object\n", + "ssssssssssssssssssssssssssssssssss56ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/x8OCVDCDrDA/hqdefault.jpg\n", + "Name: 56, dtype: object\n", + "ssssssssssssssssssssssssssssssssss57ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/yl3kavXxvHo/hqdefault.jpg\n", + "Name: 57, dtype: object\n", + "ssssssssssssssssssssssssssssssssss58ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Ihbu0aZwkE8/hqdefault.jpg\n", + "Name: 58, dtype: object\n", + "ssssssssssssssssssssssssssssssssss59ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/13viBxojGvA/hqdefault.jpg\n", + "Name: 59, dtype: object\n", + "ssssssssssssssssssssssssssssssssss60ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/DmSephyJNtQ/hqdefault.jpg\n", + "Name: 60, dtype: object\n", + "ssssssssssssssssssssssssssssssssss61ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/30pPGx0J6FU/hqdefault.jpg\n", + "Name: 61, dtype: object\n", + "ssssssssssssssssssssssssssssssssss62ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 62, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss63ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/eIRhXharV7k/hqdefault.jpg\n", + "Name: 63, dtype: object\n", + "ssssssssssssssssssssssssssssssssss64ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/2waSmpD1zQg/hqdefault.jpg\n", + "Name: 64, dtype: object\n", + "ssssssssssssssssssssssssssssssssss65ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 65, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss66ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/P4LonC3puS4/hqdefault.jpg\n", + "Name: 66, dtype: object\n", + "ssssssssssssssssssssssssssssssssss67ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/oJdubyyJNIQ/hqdefault.jpg\n", + "Name: 67, dtype: object\n", + "ssssssssssssssssssssssssssssssssss68ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/UcvCdFfI3bs/hqdefault.jpg\n", + "Name: 68, dtype: object\n", + "ssssssssssssssssssssssssssssssssss69ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/_fNZLrz97kg/hqdefault.jpg\n", + "Name: 69, dtype: object\n", + "ssssssssssssssssssssssssssssssssss70ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/1tCbvYv_ibw/hqdefault.jpg\n", + "Name: 70, dtype: object\n", + "ssssssssssssssssssssssssssssssssss71ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/EZ-im7m8630/hqdefault.jpg\n", + "Name: 71, dtype: object\n", + "ssssssssssssssssssssssssssssssssss72ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/03ahRfkfwME/hqdefault.jpg\n", + "Name: 72, dtype: object\n", + "ssssssssssssssssssssssssssssssssss73ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/h27uLjDOK-M/hqdefault.jpg\n", + "Name: 73, dtype: object\n", + "ssssssssssssssssssssssssssssssssss74ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/8OoLg39nNlo/hqdefault.jpg\n", + "Name: 74, dtype: object\n", + "ssssssssssssssssssssssssssssssssss75ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/DJd0JYaVkqA/hqdefault.jpg\n", + "Name: 75, dtype: object\n", + "ssssssssssssssssssssssssssssssssss76ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/hUXGQwTSfMs/hqdefault.jpg\n", + "Name: 76, dtype: object\n", + "ssssssssssssssssssssssssssssssssss77ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/-zcJ4uB7XUo/hqdefault.jpg\n", + "Name: 77, dtype: object\n", + "ssssssssssssssssssssssssssssssssss78ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/tQ_9a6UhUQs/hqdefault.jpg\n", + "Name: 78, dtype: object\n", + "ssssssssssssssssssssssssssssssssss79ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/ztwsGeT5lR0/hqdefault.jpg\n", + "Name: 79, dtype: object\n", + "ssssssssssssssssssssssssssssssssss80ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/nOlH-P8-5PI/hqdefault.jpg\n", + "Name: 80, dtype: object\n", + "ssssssssssssssssssssssssssssssssss81ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/BdppFIT_lIs/hqdefault.jpg\n", + "Name: 81, dtype: object\n", + "ssssssssssssssssssssssssssssssssss82ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/7nYkJctgSSA/hqdefault.jpg\n", + "Name: 82, dtype: object\n", + "ssssssssssssssssssssssssssssssssss83ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/hZHfdOKFlAw/hqdefault.jpg\n", + "Name: 83, dtype: object\n", + "ssssssssssssssssssssssssssssssssss84ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/gYTJrTXaGwA/hqdefault.jpg\n", + "Name: 84, dtype: object\n", + "ssssssssssssssssssssssssssssssssss85ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/cFTB5EJUxzw/hqdefault.jpg\n", + "Name: 85, dtype: object\n", + "ssssssssssssssssssssssssssssssssss86ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/T8EfomTlcfA/hqdefault.jpg\n", + "Name: 86, dtype: object\n", + "ssssssssssssssssssssssssssssssssss87ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/ww8dRu4_1EY/hqdefault.jpg\n", + "Name: 87, dtype: object\n", + "ssssssssssssssssssssssssssssssssss88ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Bb896qn7S54/hqdefault.jpg\n", + "Name: 88, dtype: object\n", + "ssssssssssssssssssssssssssssssssss89ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/WgnmQk_2yF4/hqdefault.jpg\n", + "Name: 89, dtype: object\n", + "ssssssssssssssssssssssssssssssssss90ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/mtp0Mu-yj_o/hqdefault.jpg\n", + "Name: 90, dtype: object\n", + "ssssssssssssssssssssssssssssssssss91ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/mkKDI6y2kyE/hqdefault.jpg\n", + "Name: 91, dtype: object\n", + "ssssssssssssssssssssssssssssssssss92ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 92, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss93ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/JToPoYip-C4/hqdefault.jpg\n", + "Name: 93, dtype: object\n", + "ssssssssssssssssssssssssssssssssss94ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/AgRHEGB8Urs/hqdefault.jpg\n", + "Name: 94, dtype: object\n", + "ssssssssssssssssssssssssssssssssss95ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/SRCToEkq7to/hqdefault.jpg\n", + "Name: 95, dtype: object\n", + "ssssssssssssssssssssssssssssssssss96ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/A6EIl677ntQ/hqdefault.jpg\n", + "Name: 96, dtype: object\n", + "ssssssssssssssssssssssssssssssssss97ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/4HD5rCNYxng/hqdefault.jpg\n", + "Name: 97, dtype: object\n", + "ssssssssssssssssssssssssssssssssss98ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 98, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss99ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/hnc3bGtYQsQ/hqdefault.jpg\n", + "Name: 99, dtype: object\n", + "ssssssssssssssssssssssssssssssssss100ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/cva2sxX5PgM/hqdefault.jpg\n", + "Name: 100, dtype: object\n", + "ssssssssssssssssssssssssssssssssss101ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/cDOlBRzHRI0/hqdefault.jpg\n", + "Name: 101, dtype: object\n", + "ssssssssssssssssssssssssssssssssss102ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Mxdze0Wo91U/hqdefault.jpg\n", + "Name: 102, dtype: object\n", + "ssssssssssssssssssssssssssssssssss103ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/YH_rnTjnWfg/hqdefault.jpg\n", + "Name: 103, dtype: object\n", + "ssssssssssssssssssssssssssssssssss104ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 104, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss105ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/WFRBxz6AeZI/hqdefault.jpg\n", + "Name: 105, dtype: object\n", + "ssssssssssssssssssssssssssssssssss106ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/7yuPVq9DtV0/hqdefault.jpg\n", + "Name: 106, dtype: object\n", + "ssssssssssssssssssssssssssssssssss107ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/vYP6GdsEmg0/hqdefault.jpg\n", + "Name: 107, dtype: object\n", + "ssssssssssssssssssssssssssssssssss108ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/7k4GbHQNmQo/hqdefault.jpg\n", + "Name: 108, dtype: object\n", + "ssssssssssssssssssssssssssssssssss109ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/o_CSmob64uU/hqdefault.jpg\n", + "Name: 109, dtype: object\n", + "ssssssssssssssssssssssssssssssssss110ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/o8Je7hPgsdU/hqdefault.jpg\n", + "Name: 110, dtype: object\n", + "ssssssssssssssssssssssssssssssssss111ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/iDFjTrl7J8w/hqdefault.jpg\n", + "Name: 111, dtype: object\n", + "ssssssssssssssssssssssssssssssssss112ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 112, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss113ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/q2CBNLsQbCM/hqdefault.jpg\n", + "Name: 113, dtype: object\n", + "ssssssssssssssssssssssssssssssssss114ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/jEYQqLtK_Xw/hqdefault.jpg\n", + "Name: 114, dtype: object\n", + "ssssssssssssssssssssssssssssssssss115ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/k66FoY5ndfI/hqdefault.jpg\n", + "Name: 115, dtype: object\n", + "ssssssssssssssssssssssssssssssssss116ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/WbW0rHCX2UU/hqdefault.jpg\n", + "Name: 116, dtype: object\n", + "ssssssssssssssssssssssssssssssssss117ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/2YoUqR9fuA4/hqdefault.jpg\n", + "Name: 117, dtype: object\n", + "ssssssssssssssssssssssssssssssssss118ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Sr0fZ298eM8/hqdefault.jpg\n", + "Name: 118, dtype: object\n", + "ssssssssssssssssssssssssssssssssss119ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/_umr17a_AdQ/hqdefault.jpg\n", + "Name: 119, dtype: object\n", + "ssssssssssssssssssssssssssssssssss120ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/XQjyjn3MdxM/hqdefault.jpg\n", + "Name: 120, dtype: object\n", + "ssssssssssssssssssssssssssssssssss121ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 121, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss122ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/m3Xf1ra2Ekg/hqdefault.jpg\n", + "Name: 122, dtype: object\n", + "ssssssssssssssssssssssssssssssssss123ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/DYsCJEfQh1U/hqdefault.jpg\n", + "Name: 123, dtype: object\n", + "ssssssssssssssssssssssssssssssssss124ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/PK-GvWWQ03g/hqdefault.jpg\n", + "Name: 124, dtype: object\n", + "ssssssssssssssssssssssssssssssssss125ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/vHab6BNrHU8/hqdefault.jpg\n", + "Name: 125, dtype: object\n", + "ssssssssssssssssssssssssssssssssss126ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/JKfFCVPjo_g/hqdefault.jpg\n", + "Name: 126, dtype: object\n", + "ssssssssssssssssssssssssssssssssss127ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/__d5Q6IF1Sg/hqdefault.jpg\n", + "Name: 127, dtype: object\n", + "ssssssssssssssssssssssssssssssssss128ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/oLBqixxgd6Y/hqdefault.jpg\n", + "Name: 128, dtype: object\n", + "ssssssssssssssssssssssssssssssssss129ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/X2bUUkWC7dE/hqdefault.jpg\n", + "Name: 129, dtype: object\n", + "ssssssssssssssssssssssssssssssssss130ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 130, dtype: object)\n", + "ssssssssssssssssssssssssssssssssss131ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/szPjXJeIGP8/hqdefault.jpg\n", + "Name: 131, dtype: object\n", + "ssssssssssssssssssssssssssssssssss132ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/eEHBjP06WSI/hqdefault.jpg\n", + "Name: 132, dtype: object\n", + "ssssssssssssssssssssssssssssssssss133ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/epgHrLszj-Q/hqdefault.jpg\n", + "Name: 133, dtype: object\n", + "ssssssssssssssssssssssssssssssssss134ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/t3ppxtEU6No/hqdefault.jpg\n", + "Name: 134, dtype: object\n", + "ssssssssssssssssssssssssssssssssss135ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/yd62ObxkV44/hqdefault.jpg\n", + "Name: 135, dtype: object\n", + "ssssssssssssssssssssssssssssssssss136ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/AkiC0_09Zss/hqdefault.jpg\n", + "Name: 136, dtype: object\n", + "ssssssssssssssssssssssssssssssssss137ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/Xz5XIHrT4LQ/hqdefault.jpg\n", + "Name: 137, dtype: object\n", + "ssssssssssssssssssssssssssssssssss138ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/_lsDECLUt3k/hqdefault.jpg\n", + "Name: 138, dtype: object\n", + "ssssssssssssssssssssssssssssssssss139ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/iBsg75W2Vig/hqdefault.jpg\n", + "Name: 139, dtype: object\n", + "ssssssssssssssssssssssssssssssssss140ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/sUtkJUJuq2U/hqdefault.jpg\n", + "Name: 140, dtype: object\n", + "ssssssssssssssssssssssssssssssssss141ssssssssssssssssssssssssssssssssss\n", + "0 https://i.ytimg.com/vi/YzhLEjUD8hk/hqdefault.jpg\n", + "Name: 141, dtype: object\n", + "ssssssssssssssssssssssssssssssssss142ssssssssssssssssssssssssssssssssss\n", + "Series([], Name: 142, dtype: object)\n" + ] + } + ], + "source": [ + "for column in df3.T.columns:\n", + " print('ssssssssssssssssssssssssssssssssss' + str(column) + 'ssssssssssssssssssssssssssssssssss')\n", + " print(df3.T[column].dropna())" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDtagsnameurl
0PyCharm/IntelliJ fast and auto change of the color theme41.00.00.00.02.0https://www.youtube.com/embed/SsX9Fl958W0https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg<pandas.io.formats.style.Styler object at 0x7ff60af976d8>
1How to add weather desklet to Linux Mint 19291.00.00.00.00.0https://www.youtube.com/embed/-FPY_e0BdJshttps://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg<pandas.io.formats.style.Styler object at 0x7ff60af976d8>
2How to easy integrate Google Calendar to Desktop for Linux Mint226.01.00.00.00.0https://www.youtube.com/embed/2evIujisdD0https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg<pandas.io.formats.style.Styler object at 0x7ff60af976d8>
3Pandas use a list of values to select rows from a column45.03.00.00.010.0https://www.youtube.com/embed/jlSbo5wmTPQhttps://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg<pandas.io.formats.style.Styler object at 0x7ff60af976d8>
4Pandas count and percentage by value for a column63.03.00.00.00.0https://www.youtube.com/embed/P5pxJkv71BUhttps://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg<pandas.io.formats.style.Styler object at 0x7ff60af976d8>
\n", + "
" + ], + "text/plain": [ + " title Views \\\n", + "0 PyCharm/IntelliJ fast and auto change of the color theme 41.0 \n", + "1 How to add weather desklet to Linux Mint 19 291.0 \n", + "2 How to easy integrate Google Calendar to Desktop for Linux Mint 226.0 \n", + "3 Pandas use a list of values to select rows from a column 45.0 \n", + "4 Pandas count and percentage by value for a column 63.0 \n", + "\n", + " Like Dislike Favorite Comment \\\n", + "0 0.0 0.0 0.0 2.0 \n", + "1 0.0 0.0 0.0 0.0 \n", + "2 1.0 0.0 0.0 0.0 \n", + "3 3.0 0.0 0.0 10.0 \n", + "4 3.0 0.0 0.0 0.0 \n", + "\n", + " videoID \\\n", + "0 https://www.youtube.com/embed/SsX9Fl958W0 \n", + "1 https://www.youtube.com/embed/-FPY_e0BdJs \n", + "2 https://www.youtube.com/embed/2evIujisdD0 \n", + "3 https://www.youtube.com/embed/jlSbo5wmTPQ \n", + "4 https://www.youtube.com/embed/P5pxJkv71BU \n", + "\n", + " tags \\\n", + "0 https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg \n", + "1 https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg \n", + "2 https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg \n", + "3 https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg \n", + "4 https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg \n", + "\n", + " nameurl \n", + "0 \n", + "1 \n", + "2 \n", + "3 \n", + "4 " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def make_clickable(val):\n", + " # target _blank to open new window\n", + " return '{}'.format(val, val)\n", + "\n", + "df['nameurl'] = df.style.format({'videoID': make_clickable})\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDtagsnameurl
0PyCharm/IntelliJ fast and auto change of the color theme41.00.00.00.02.0https://www.youtube.com/embed/SsX9Fl958W0https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpgXXXXX
1How to add weather desklet to Linux Mint 19291.00.00.00.00.0https://www.youtube.com/embed/-FPY_e0BdJshttps://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpgXXXXX
2How to easy integrate Google Calendar to Desktop for Linux Mint226.01.00.00.00.0https://www.youtube.com/embed/2evIujisdD0https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpgXXXXX
3Pandas use a list of values to select rows from a column45.03.00.00.010.0https://www.youtube.com/embed/jlSbo5wmTPQhttps://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpgXXXXX
4Pandas count and percentage by value for a column63.03.00.00.00.0https://www.youtube.com/embed/P5pxJkv71BUhttps://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpgXXXXX
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import HTML\n", + "\n", + "df['nameurl'] = df['videoID'].apply(lambda x: 'XXXXX'.format(x))\n", + "HTML(df.head().to_html(escape=False))" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDtagsnameurl
91No Python Interpreter Configured For The Module - PyCharm/IntelliJ11,367.027.020.00.08.0https://www.youtube.com/embed/mkKDI6y2kyEhttps://i.ytimg.com/vi/mkKDI6y2kyE/hqdefault.jpgXXXXX
124python extract text from image or pdf6,229.016.029.00.011.0https://www.youtube.com/embed/PK-GvWWQ03ghttps://i.ytimg.com/vi/PK-GvWWQ03g/hqdefault.jpgXXXXX
23apex legends game requires directx 11 feature video card5,690.036.010.00.09.0https://www.youtube.com/embed/NbvHU_KoD74https://i.ytimg.com/vi/NbvHU_KoD74/hqdefault.jpgXXXXX
46Extract tabular data from PDF with Python - Tabula, Camelot, PyPDF25,397.062.02.00.026.0https://www.youtube.com/embed/702lkQbZx50https://i.ytimg.com/vi/702lkQbZx50/hqdefault.jpgXXXXX
134ubuntu 16 04 server install headless google chrome4,468.024.06.00.05.0https://www.youtube.com/embed/t3ppxtEU6Nohttps://i.ytimg.com/vi/t3ppxtEU6No/hqdefault.jpgXXXXX
125mysql 5 7 vs mysql 8 do you need to upgrade to mysql 84,391.012.018.00.09.0https://www.youtube.com/embed/vHab6BNrHU8https://i.ytimg.com/vi/vHab6BNrHU8/hqdefault.jpgXXXXX
116Python read validate and import CSV JSON file to MySQL3,513.012.01.00.06.0https://www.youtube.com/embed/WbW0rHCX2UUhttps://i.ytimg.com/vi/WbW0rHCX2UU/hqdefault.jpgXXXXX
68How to add annotations in new Youtube studio3,495.021.024.00.06.0https://www.youtube.com/embed/UcvCdFfI3bshttps://i.ytimg.com/vi/UcvCdFfI3bs/hqdefault.jpgXXXXX
32Apex Legends MSVCP140.dll Is Missing Fix, MSVCP120.dll Is Missing, not starting2,358.014.02.00.011.0https://www.youtube.com/embed/ftGiBv3LL_Ahttps://i.ytimg.com/vi/ftGiBv3LL_A/hqdefault.jpgXXXXX
13Install latest NVIDIA drivers for Linux Mint 19/Ubuntu 18.041,728.013.00.00.06.0https://www.youtube.com/embed/CA6lyOmfRbMhttps://i.ytimg.com/vi/CA6lyOmfRbM/hqdefault.jpgXXXXX
80Simple ways to create shortcut in Linux Mint 191,652.09.02.00.06.0https://www.youtube.com/embed/nOlH-P8-5PIhttps://i.ytimg.com/vi/nOlH-P8-5PI/hqdefault.jpgXXXXX
52linux mint disable login keyring1,592.08.01.00.011.0https://www.youtube.com/embed/dAKyi8aFq3Yhttps://i.ytimg.com/vi/dAKyi8aFq3Y/hqdefault.jpgXXXXX
81The simplest way to run python headless test with Chrome on Ubuntu1,077.08.00.00.02.0https://www.youtube.com/embed/BdppFIT_lIshttps://i.ytimg.com/vi/BdppFIT_lIs/hqdefault.jpgXXXXX
122java benchmarks examples922.05.03.00.00.0https://www.youtube.com/embed/m3Xf1ra2Ekghttps://i.ytimg.com/vi/m3Xf1ra2Ekg/hqdefault.jpgXXXXX
76Easy way to convert dictionary to SQL insert with Python864.03.00.00.00.0https://www.youtube.com/embed/hUXGQwTSfMshttps://i.ytimg.com/vi/hUXGQwTSfMs/hqdefault.jpgXXXXX
14Linux Mint identify, fix sound problems, set default device859.04.00.00.01.0https://www.youtube.com/embed/PIAzK1rvqIYhttps://i.ytimg.com/vi/PIAzK1rvqIY/hqdefault.jpgXXXXX
71python performance profiling in pycharm825.00.03.00.00.0https://www.youtube.com/embed/EZ-im7m8630https://i.ytimg.com/vi/EZ-im7m8630/hqdefault.jpgXXXXX
70Python Cumulative Sum per Group with Pandas801.05.00.00.01.0https://www.youtube.com/embed/1tCbvYv_ibwhttps://i.ytimg.com/vi/1tCbvYv_ibw/hqdefault.jpgXXXXX
50Linux Mint 19 How to change user password735.06.00.00.02.0https://www.youtube.com/embed/Odog86JslbAhttps://i.ytimg.com/vi/Odog86JslbA/hqdefault.jpgXXXXX
21play fortnite linux virtual machine532.02.01.00.01.0https://www.youtube.com/embed/t_DI7NbjcFshttps://i.ytimg.com/vi/t_DI7NbjcFs/hqdefault.jpgXXXXX
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df2 = df.sort_values(by=['Views'], ascending=False).head(20)\n", + "\n", + "HTML(df2.to_html(escape=False))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD8CAYAAACcjGjIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAIABJREFUeJzt3Xd8FHX+x/HXZ5OQAKGFEmlCKIIkAUIKQQQVqaKAHl3pRcTueaeeBT31Tu9QsSBKExAEFAX5nQoCoiLSQq9CKEKogVATSNvv74+dhIWlp0w2+Twfj31k9zvfmflMJtn3TtkZMcaglFJKuXPYXYBSSqmCR8NBKaWUBw0HpZRSHjQclFJKedBwUEop5UHDQSmllAcNB6WUUh40HJRSSnnQcFBKKeXB1+4CblSFChVMzZo17S5DKaW8xurVq48aYypeS1+vDYeaNWsSFxdndxlKKeU1ROTPa+2ru5WUUkp50HBQSinlQcNBKaWUB6895nAp6enpJCQkcO7cObtL8XoBAQFUq1YNPz8/u0tRStmgUIVDQkICpUqVombNmoiI3eV4LWMMx44dIyEhgZCQELvLUUrZoFDtVjp37hzly5fXYMghEaF8+fK6BaZUEVaowgHQYMgl+ntUqmgrdOGglIL4I2f4enUCehtgdaM0HHLRXXfdxfz58y9oGzVqFAMGDKBr1642VaWKmsV/HKHL6KX89av1vDp3swaEuiEaDrmoV69ezJgx44K2GTNmMGDAAGbNmmVTVaqoMMbw2dLdDJq0iupBJXgo9mYmL/uTl7/dhNOpAaGuj4ZDLuratSvfffcdaWlpAOzZs4cDBw5QvXp1wsLCAMjMzORvf/sb0dHRNGzYkE8//RSARx99lLlz5wJw//33M3DgQAAmTpzIiy++SHJyMh07dqRRo0aEhYUxc+ZMG5ZQFVTpmU5emrOJ1/5vC3ffGsysYc14vXMYw+6ozdTle3lxjgaEuj6F6lRWd6/932a2HDiVq9NsUKU0I+4LvezwoKAgYmJi+OGHH+jcuTMzZsyge/fuFxzcnTBhAmXKlGHVqlWkpqbSvHlz2rZtS4sWLViyZAmdOnVi//79HDx4EIAlS5bQs2dP5s2bR5UqVfjuu+8AOHnyZK4um/JeJ1PSefSLNfwWf5SH76jFc+3q43C4/uaea18PHweMXrwTYwz/uj88e5hSV6JbDrnMfdfSjBkz6NWr1wXDf/zxR6ZMmULjxo1p2rQpx44dY8eOHdnhsGXLFho0aEBwcDAHDx5k2bJl3HbbbYSHh7NgwQKee+45lixZQpkyZexYPFXA7DmazP1jlrJi9zH+07UhL3S49YI3fxHh2bb1eKJVHWas2sffv95Apm5BqGtQaLccrvQJPy917tyZp59+mjVr1pCSkkJkZCR79uzJHm6M4cMPP6Rdu3Ye4544cYJ58+bRsmVLkpKS+PLLLwkMDKRUqVKUKlWKNWvW8P333/PSSy9x991388orr+TjkqmCZvmuYwybuhqAzwc1JbZW+Uv2ExGeaVsPh0MYtXAHTqfhv90a4aNbEOoKCm042CUwMJC77rqLgQMHemw1ALRr144xY8bQqlUr/Pz82L59O1WrVqVkyZLExsYyatQofvrpJ44dO0bXrl2zz3I6cOAAQUFBPPTQQ5QtW5bx48fn96KpAuTLVft4cc5Gbg4qwYR+0dSsUPKq4zzV+hZ8RHhnwXacxjCyWyN8fXTngbo0DYc80KtXL+6//36PM5cABg8ezJ49e2jSpAnGGCpWrMicOXMAaNGiBT/++CN16tShRo0aJCUl0aJFCwA2btzI3/72NxwOB35+fowZMyZfl0kVDJlOw9vztjH21120qFuBj3o3oUzxa7/+1eN318XhEP47/w8yDbzXXQNCXZp46znQUVFR5uKb/WzdupVbb73VpooKH/19FizJqRk8OWMdC7ce5qHYmxlxXyh+N/jG/skvO3nrh210DK/MqJ6Nb3g6yruIyGpjTNS19NUtB6W8wIETZxk0OY4/Dp3i1fsa0O+2nF1cctgdtfF1CG98txWnMXzQK0IDQl1A/xqUKuDW7TtB59FL2ZeUwoT+0fRvHpIr174a3KIWL9/bgB82HeLRaWtIy3DmQrWqsNBwUKoA+7/1B+jx6TL8fR18M/w27qpXKVenP+j2EF69rwE/bjnM8GmrSc3IzNXpK++l4aBUAWSM4f2FO3h8+lrCq5bh20ebc0twqTyZV//mIbzeOZSFW4/wyNQ1nEvXgFAaDkoVOOfSM3lyxjreW7idByKqMm1IU8oH+ufpPPs0q8m/7g/np21HePjz1RoQSsNBqYIk8XQqvcYtZ+76A/ytXT3e6d4If1+ffJl376Y389YD4fy6I5EhU+I0IIo4DYdcFhgY6NH2ySefMGXKFADuvPNOLj4FVymArQdP0WX0UrYePMWYB5vw6F118v2mSz1jbubtvzTkt/ijDJ4cx9k0DYiiSk9lzQfDhg2zuwRVwC3aepgnpq8lMMCXrx6+jfBq9l07q3tUdXxEeHbWegZOWsWE/lGUKKZvFUWNbjnkg1dffZWRI0de0OZ0Ounfvz8vvfQS4LogX7NmzWjSpAndunXjzJkzdpSq8pkxhvFLdjF4ShwhFUvy7aO32xoMWf4SWY33ujdmxe5jDPhsFcmpGXaXpPJZ4f048MPzcGhj7k7zpnDo8FaOJ5ORkcGDDz5IWFgYL774IkePHuWNN95g4cKFlCxZkrfffpt3331XL6xXyKVnOnnl201MX7mP9qE38W6PRgXqE3qXiKqIwNMz1zHgs1VMHBBNoH/BqU/lLV3TNnj44Yfp3r07L774IgDLly9ny5YtNG/eHIC0tDSaNWtmZ4kqj51ISeORqWtYtusYw++szbPWVVMLms6Nq+LjEJ6csY7+E1fy2YBoSgVc+7WclPe6ajiIyETgXuCIMSbMagsCZgI1gT1Ad2PMcXEdPXsfuAdIAfobY9ZY4/QDXrIm+4YxZrLVHglMAooD3wNPmty44FMufMLPK7fddhuLFy/mr3/9KwEBARhjaNOmDdOnT7e7NJUPdiWeYdDkOBKOp/BOt0b8JbKa3SVd0b0Nq+AjwuPT19J34komD4yhtAZEoXctxxwmAe0vanseWGSMqQsssl4DdADqWo+hwBjIDpMRQFMgBhghIuWsccYAQ9zGu3hehc6gQYO455576N69OxkZGcTGxrJ06VLi4+MBSE5OZvv27TZXqfLC7/FHuf/j3zl5Np0vhsQW+GDI0iG8Mh/1bsLGhJP0mbCSk2fT7S5J5bGrhoMx5lcg6aLmzsBk6/lkoItb+xTjshwoKyKVgXbAAmNMkjHmOLAAaG8NK22MWW5tLUxxm5ZXSklJoVq1atmPd99995L9nnnmGSIiIujTpw/ly5dn0qRJ9OrVi4YNG9KsWTO2bduWz5WrvDZ95V76TlxJpVL+zBnenOiaQXaXdF3ah93EmIci2XLgJH0mrOBkigZEYXajxxyCjTEHreeHgGDreVVgn1u/BKvtSu0Jl2j3Wk7nlS9e9vPPP2c/f+2117Kft2rVilWrVuVVWcpGmU7Dv77fyoTfdtPylop81DvCa3fLtGkQzCcPRfLI1DU8OGE5Uwc1pWyJYnaXpfJAjk9ltT7x58tNIURkqIjEiUhcYmJifsxSqRw5k5rBkClxTPhtN/1vq8nEflFeGwxZ7r41mE/7RLL98Bl6j1vB8eQ0u0tSeeBGw+GwtUsI6+cRq30/UN2tXzWr7Urt1S7RfknGmLHGmChjTFTFihVvsHSl8kfC8RS6jvmdX7Yn8nrnUF7tFFpo7rp2V/1KjOsbxc7EM/Qat5xjZ1LtLknlshv9S50L9LOe9wO+dWvvKy6xwElr99N8oK2IlLMORLcF5lvDTolIrHWmU1+3aSnltdbsPU6X0UvZf+IskwZE06dZTbtLynV33FKRCf2i2X00md7jVnBUA6JQuWo4iMh0YBlQT0QSRGQQ8BbQRkR2AK2t1+A6FXUXEA+MA4YDGGOSgNeBVdbjn1YbVp/x1jg7gR9yZ9GUsse36/bTc+xyShTzZfbw22hRt/Bu5d5etwKf9Y/mz6Rkeo1dTuJpDYjCQu8hrS5Lf5/Xx+k0jFq4nQ9+iiemZhCf9IkkqGTROFi7bOcxBk5aRZWyAUwfEkul0gF2l6Qu4XruIV04doAqZbNz6Zk8PmMtH/wUT9fIanw+OKbIBANAs9rlmTwwhoMnz9Fz7HIOnzpnd0kqhzQccpmPjw+NGzcmNDSURo0a8c4772Sf3hoXF8cTTzxx2XH37NlDWFiYR99LXbhPFRxHTp2jx9jlfL/xIM93qM9/uzbMt3swFCQxIUFMGRjD4VOugDh0UgPCm+m1lXJZ8eLFWbduHQBHjhyhd+/enDp1itdee42oqCiioq5pi+66+ir7bD5wksGT4ziRks4nD0XSLvQmu0uyVVTNIKYMiqHfxFX0GLuM6UNiqVK2uN1lqRugWw55qFKlSowdO5aPPvoIYww///wz9957LwC//PILjRs3pnHjxkRERHD69OkLxnXv627cuHF06NCBs2fPsnPnTtq3b09kZCQtWrTQb1Xnsx83H6LbJ8swBr4a1qzIB0OWyBpBfD4ohqQzafQYu4yE4yl2l6RuQKHdcnh75dtsS8rdN8v6QfV5Lua56xqnVq1aZGZmcuTIkQvaR44cyejRo2nevDlnzpwhIODqB/A++ugjFixYwJw5c/D392fo0KF88skn1K1blxUrVjB8+HB++umn66pPXT9jDGN/3cVb87YRXrUM4/pGEawHYC8QcXM5pg5uSp8JK+g5djnTh8RSPaiE3WWp61Bow6Gga968Oc888wwPPvggDzzwANWqXfkCbFOmTKF69erMmTMHPz8/zpw5w++//063bt2y+6Sm6mmEeS0tw8lLczbyZVwCHcMrM7JbI4oXK3rHF65Fo+plmTY4lofcAuLm8hoQ3qLQhsP1fsLPK7t27cLHx4dKlSqxdevW7Pbnn3+ejh078v3339O8eXPmz59/xa2H8PBw1q1bR0JCAiEhITidTsqWLZt9fEPlvePJaQybupoVu5N4olUdnmp9S4G8B0NBEl6tDNMGN7UCYhlfDImlZoWSdpelroEec8hDiYmJDBs2jMcee8zjRvE7d+4kPDyc5557jujo6KseL4iIiODTTz+lU6dOHDhwgNKlSxMSEsJXX30FuHZ1rF+/Ps+WpaiLP3KGLh8vZe3eE4zq0ZhnCujNeQqisKpl+GJwLGfTM+k5djm7jybbXZK6BhoOuezs2bPZp7K2bt2atm3bMmLECI9+o0aNIiwsjIYNG+Ln50eHDh2uOu3bb7+dkSNH0rFjR44ePcq0adOYMGECjRo1IjQ0lG+/1SuP5IXfdhzl/o+XkpyawfShsXSJ8OoLB9uiQZXSTB8aS3qmkx6fLmNnot4jvaDTb0iry9LfJ0xd/icj5m6mTsVAxveL0oOqObT98Gl6j1uOiDB9SFPqVCpld0lFin5DWqkcysh08urczbw0ZxMt61Zg1iPNNBhywS3BpZgxNBaAnmNXsP3w6auMoeyi4aDURU6dS2fQ5Dgm/b6Hgc1DGN8vmlJefg+GgqROJVdAOAR6jV3OtkOn7C5JXYKGg1Ju9iW57sGwNP4ob94fxiv3NcBHDzznutoVA5kxNBZfH6H3uBVsOaABUdBoOChliduTROfRSzl08hyTB8bwYNMadpdUqNWqGMjMoc3w93XQe/xyNh84aXdJyo2Gg1LAN2sS6D1uBaUDfJn9aHOa16lgd0lFQs0KJZk5tBkli/nSe9wKNu3XgCgoNBxUkeZ0Gv47fxvPfLmeJjXKMnt4c2pXDLS7rCLl5vIlmDE0lkB/X3qPW876fSfsLkmh4ZDrsi7ZnfXYs2dPns3rwIEDdO3aFYB169bx/fff59m8CqOzaZk8+sUaRi/eSc/o6kwZ2JRyRegeDAVJ9aASzHw4ljIl/HhowgrW7j1ud0lFnoZDLsu6ZHfWo2bNmnkyn4yMDKpUqcKsWbMADYfrdfjUObp/uox5mw/xUsdb+fcD4RTz1X8HO1UrV4IZQ5tRrkQx+k5Yyeo/NSDspP8N+WDPnj20aNGCJk2a0KRJE37//XcAevbsyXfffZfdr3///syaNYtz584xYMAAwsPDiYiIYPHixQBMmjSJTp060apVK+6+++7smwOlpaXxyiuvMHPmTBo3bszMmTNJTk5m4MCBxMTEEBERod+edrNp/0k6ffQbOxPPMK5PFINb1PK4vImyR9WyxZn5cCzlA4vRb+JK4vYkXX0klScK7YX3Dv3rX6Ruzd1LdvvfWp+b/vGPK/bJunwGQEhICLNnz6ZSpUosWLCAgIAAduzYQa9evYiLi6NHjx58+eWXdOzYkbS0NBYtWsSYMWMYPXo0IsLGjRvZtm0bbdu2Zfv27QCsWbOGDRs2EBQUlL3LqlixYvzzn/8kLi6Ojz76CIB//OMftGrViokTJ3LixAliYmJo3bo1JUsW7Yuezdt0kKdnrqdcCT9mDbuNBlVK212SukjlMsWZ+XAzeo1dTt+JK5k0IIaYkCC7yypyCm042MX9TnBZ0tPTeeyxx1i3bh0+Pj7Zb/QdOnTgySefJDU1lXnz5tGyZUuKFy/Ob7/9xuOPPw5A/fr1qVGjRvY4bdq0ISjo6v8oP/74I3Pnzs2+vei5c+fYu3dvkb0chjGGMb/s5D/z/qBx9bKM7RtJpVJ6D4aCKrh0ADOGxtJr3HL6f7aSif2jia1V3u6yipRCGw5X+4Sfn9577z2Cg4NZv349Tqcz+9LcAQEB3HnnncyfP5+ZM2fSs2fPq07rWj/5G2P4+uuvqVevXo5qLwxSMzJ54ZuNfLNmP/c1qsJ/uzYkwE/vwVDQVSodwPShsTw4boUrIPpFc5ueYpxv9JhDPjh58iSVK1fG4XDw+eefk5mZmT2sR48efPbZZyxZsoT27dsD0KJFC6ZNmwbA9u3b2bt371Xf5EuVKnXBrUbbtWvHhx9+SNaFFdeuXZvbi+UVjp1J5aHxK/hmzX6eal2XD3o21mDwIpVKuQKiRlBJBkxaxW87jtpdUpGh4ZAPhg8fzuTJk2nUqBHbtm274NN/27Zt+eWXX2jdujXFihXL7u90OgkPD6dHjx5MmjQJf3//K87jrrvuYsuWLdkHpF9++WXS09Np2LAhoaGhvPzyy3m6jAXRjsOn6fLxUjYknOSDXhE81foWPfDshSoE+vPFkKaEVCjJoMmr+HV7ot0lFQl6yW51Wd78+/xleyKPTVuDv58P4/pGEnFzObtLUjmUlJzGQ+NXEJ94hrF9IrmzXiW7S/I6esluVaRN/n0PAz5bSdVyxfn2seYaDIVEUMlifDGkKXUrBTJ0ymp+2nbY7pIKtRyFg4g8LSKbRWSTiEwXkQARCRGRFSISLyIzRaSY1dffeh1vDa/pNp0XrPY/RKRdzhZJFVUZmU5enrOJEXM306p+JWY9chtVyxa3uyyVi8qWKMYXg2OpX7kUD3++mgVbNCDyyg2Hg4hUBZ4AoowxYYAP0BN4G3jPGFMHOA4MskYZBBy32t+z+iEiDazxQoH2wMciokcM1XU5eTadAZNW8fnyPxnashaf9oki0L/QnoxXpJUp4cfng5rSoEoZhk9bzfzNh+wuqVDK6W4lX6C4iPgCJYCDQCtgljV8MtDFet7Zeo01/G5xHR3sDMwwxqQaY3YD8UBMDutSRcifx5J54OOlLNt5jLf/Es4/7rlV78FQyJUp7sfng2IIq1qGR6et4YeNB+0uqdC54XAwxuwHRgJ7cYXCSWA1cMIYk2F1SwCy7sZeFdhnjZth9S/v3n6JcZS6opW7k+gyeinHktP4fFBTekTfbHdJKp+UDvBjysAYGlUvy2PT1/LdBg2I3JST3UrlcH3qDwGqACVx7RbKMyIyVETiRCQuMVFPZyvqvorbx4Pjl1OuRDFmD29Os9r6DdqiplSAH5MHxtDk5rI8MWMtc9cfsLukQiMnu5VaA7uNMYnGmHTgG6A5UNbazQRQDdhvPd8PVAewhpcBjrm3X2KcCxhjxhpjoowxURUrVsxB6Xnn0KFD9OzZk9q1axMZGck999yTfemLgubnn3/OvgigN3E6DW/9sI2/zdpATEgQs4c3J6RC0b5mVFEW6O/LpAExRNYox1Mz1jJn7SXfPtR1ykk47AViRaSEdezgbmALsBjoavXpB2RdDnSu9Rpr+E/G9SWLuUBP62ymEKAusDIHddnGGMP999/PnXfeyc6dO1m9ejX//ve/OXy4YJ5R4Y3hkJKWwbCpq/nkl530bnozkwbEUKaEn91lKZuV9Pdl0oBomoaU55kv1/HNmgS7S/J+xpgbfgCvAduATcDngD9QC9ebezzwFeBv9Q2wXsdbw2u5TedFYCfwB9DhWuYdGRlpLrZlyxaPtvy0aNEi06JFC492p9Npnn32WRMaGmrCwsLMjBkzjDHGLF682LRs2dJ06tTJhISEmOeee85MnTrVREdHm7CwMBMfH2+MMaZfv35m2LBhpmnTpiYkJMQsXrzYDBgwwNSvX9/069cvez7z5883sbGxJiIiwnTt2tWcPn3aGGNMjRo1zCuvvGIiIiJMWFiY2bp1q9m9e7cJDg42VapUMY0aNTK//vqrR912/z4vduBEirnn/V9NyPP/MxOW7DJOp9PuklQBk5KaYXqPW2ZqPv8/8+WqvXaXU+AAceYa399zdK6fMWYEMOKi5l1c4mwjY8w5oNtlpvMm8GZOarnYki+3c3TfmdycJBWqB9Ki+y2XHb5p0yYiIyM92r/55hvWrVvH+vXrOXr0KNHR0bRs2RKA9evXs3XrVoKCgqhVqxaDBw9m5cqVvP/++3z44YeMGjUKgOPHj7Ns2TLmzp1Lp06dWLp0KePHjyc6Opp169ZRrVo13njjDRYuXEjJkiV5++23effdd3nllVdctVeowJo1a/j4448ZOXIk48ePZ9iwYQQGBvLss8/m6u8pL6zfd4IhU+JISctkQr9o7qqv345VnooX82FCv2iGTInj719vwGmMnqRwg/RE8Hzw22+/0atXL3x8fAgODuaOO+5g1apVlC5dmujoaCpXrgxA7dq1adu2LQDh4eHZN/kBuO+++xARwsPDCQ4OJjw8HIDQ0FD27NlDQkICW7ZsoXnz5gCkpaXRrFmz7PEfeOABACIjI/nmm2/yZblzy3cbDvLMl+uoEOjP1480pd5NpewuSRVgAX4+jOsbxcOfr+a5rzeS6YTeTTUgrlehDYcrfcLPK6Ghodm37bxW7hfUczgc2a8dDgcZGRke/dz7uPfz8fGhTZs2TJ8+/Yrz8fHxuWC6BZkxho9+iuedBduJrFGOT/tEUiHwyhcgVApcAfFpn0gembqaf8zeSKYx9ImtYXdZXkWvrZSLWrVqRWpqKmPHjs1u27BhA2XLlmXmzJlkZmaSmJjIr7/+SkxM7n7PLzY2lqVLlxIfHw9AcnLyVc+Suvgy3wXJufRMnp65jncWbKdL4ypMG9xUg0FdlwA/Hz7pE0nrWyvx8pxNTP59j90leRUNh1wkIsyePZuFCxdSu3ZtQkNDeeGFF+jduzcNGzakUaNGtGrViv/85z/cdNNNuTrvihUrMmnSJHr16kXDhg1p1qwZ27Zd+Tap9913H7Nnz6Zx48YsWbIkV+vJiaNnUnlw/ArmrDvAX9vcwns99B4M6sb4+/rw8YORtGkQzIi5m5n42267S/IaesludVl2/D7/OHSagZNWcSw5lXe6NaZjw8r5On9VOKVnOnn8i7XM23yIlzreyuAWtewuyRZ6yW7llRZvO8JfxvxOeqaTLx9upsGgco2fj4MPe0fQMbwyb3y3lU9/2Wl3SQVeoT0grbyHMYbPlu7hje+2UP+m0kzoH0XlMnqpbZW7/HwcvN+zMSLw7x+2kWkMw++sY3dZBVahCwdjjN4KMhfk1+7G9EwnI+Zu5osVe2nbIJj3ejSmpF5qW+URXx8Ho3o0xsch/GfeH2RmGh6/u67dZRVIheq/MCAggGPHjlG+fHkNiBwwxnDs2DECAgLydD4nU9IZ/sVqlsYfY9gdtfl7u3o49FLbKo/5+jh4t3tjfER4Z8F2Mo3hqdb5f+p7QVeowqFatWokJCSgV2zNuYCAAKpVq5Zn0999NJlBk1ax73gK/+3akG5R1a8+klK5xMch/LdbIxwOYdTCHTgNPN26rn6odFOowsHPz4+QkBC7y1BXsWznMYZNXY1DYOqgpjStpZfaVvnPxyH85y8NcQh8sGgHTqfhr21v0YCwFKpwUAXfzFV7eXH2JmpWKMmEflHUKK+X2lb2cTiEtx5oiI9D+GhxPBlOw3Pt62lAoOGg8kmm0/DWD1sZt2Q3LepWYPSDTSgdoJfaVvZzOIQ3u4Tj4xA++WUnTmN4oUP9Ih8QGg4qzyWnZvDkjLUs3HqEvs1q8Mq9DfD10a/YqILD4RBe7xyGQ4Sxv+4i02l4qeOtRTogNBxUntp/4iyDJ8fxx6FTvNYplH631bS7JKUuSUR4rVMoDhEm/LabTKdhxH0NimxAaDioPLN273GGTFlNanomnw2I4Y5bCuatXZXKIiKMuK8BPg5XQDiN4bVOoUUyIDQcVJ6Yu/4Az361nuDS/kwf0pS6wXoPBuUdRISXOt6Kr0P41NrF9HrnsCL3HRwNB5WrjDG8v2gHoxbuILpmOT55KJLyeqlt5WVEhOc71MfhEMb87DpI/WaX8CIVEBoOKtecS8/k77M2MHf9AR5oUpV/PxCOv69ealt5JxHh7+3q4SOu01wznYa3HmhYZAJCw0HliiOnzzF0ymrW7TvB39vX45E7ahfJ/bSqcBER/tr2FhwO4YNFO8h0wn+6ur4XUdhpOKgc23rwFIMmreJ4SjqfPBRJ+7DcvZGRUnYSEZ5pcws+Iry3cDvGGP7brVGhDwgNB5UjC7cc5okZaykV4MtXw5oRVrWM3SUplSeebF0XHweM/NF1sb53ujUq1N/X0XBQN8QYw/glu/nXD1sJq1KGcX2juKlM3l7FVSm7PdaqLo6sy307DaN6NC60AaHhoK5bWoaTl+dsYmbcPjqE3cS73RtTvJgeeFZFw/BsqJMgAAAUOUlEQVQ76+Ajwr9/2IbTGN7vGYFfIQwIDQd1XU6kpDFs6mqW70risbvq8EybW4rM2RtKZXn4jtr4OIQ3vtuK07mWD3pFUMy3cAWEhoO6ZrsSzzBochz7j5/l3e6NeKBJ3t3vQamCbnCLWjhE+Of/tvDMl+v4oGdEofqglKOoE5GyIjJLRLaJyFYRaSYiQSKyQER2WD/LWX1FRD4QkXgR2SAiTdym08/qv0NE+uV0oVTuWxp/lC6jl3LybDpfDGmqwaAUMPD2EJ7vUJ//bTjIm99vtbucXJXT7aD3gXnGmPpAI2Ar8DywyBhTF1hkvQboANS1HkOBMQAiEgSMAJoCMcCIrEBRBcO0FX/Sd+JKbioTwLePNieqZpDdJSlVYDzcshb9b6vJhN92M+7XXXaXk2tueLeSiJQBWgL9AYwxaUCaiHQG7rS6TQZ+Bp4DOgNTjOvO9cutrY7KVt8Fxpgka7oLgPbA9ButTeWOTKfhze+2MnHpbu6sV5EPe0VQSu/BoNQFRIRX7m1A4ulU3vx+K5VK+9O5cVW7y8qxnBxzCAESgc9EpBGwGngSCDbGHLT6HAKCredVgX1u4ydYbZdrVzY6fS6dJ2es46dtR+h/W03XhcgK4RkZSuUGh0N4p3sjjp5J5dmv1lO+pD+3161gd1k5kpP/dl+gCTDGGBMBJHN+FxIA1laCycE8LiAiQ0UkTkTiEhMTc2uy6iL7klLoOmYZv2xP5PUuYbzaKVSDQamrCPDzYWzfKGpXDOThz+PYtP+k3SXlSE7+4xOABGPMCuv1LFxhcdjaXYT184g1fD9Q3W38albb5do9GGPGGmOijDFRFSvqvQHywuo/k+gyeikHTp5l0oBo+sTWsLskpbxGmeJ+TBoQQ5nifvT/bBX7klLsLumG3XA4GGMOAftEpJ7VdDewBZgLZJ1x1A/41no+F+hrnbUUC5y0dj/NB9qKSDnrQHRbq03lszlr99Nr7AoCA3yZPbw5LepqACt1vW4qE8CUQTGkZzrpO3ElSclpdpd0Q3K6r+BxYJqIbAAaA/8C3gLaiMgOoLX1GuB7YBcQD4wDhgNYB6JfB1ZZj39mHZxW+cPpNLz74x88NXMdjW8uy5zhzalTKdDuspTyWnUqlWJCvygOnDjLwEmrSEnLsLuk6yauwwLeJyoqysTFxV33eD0+XYafj4OgksUu+ShfshjlShajXIlihf6qiwBn0zJ59qv1fLfxIN2jqvFGl/BC901Ppewyf/MhHpm6mjvrVWJsn0jbj92JyGpjTNS19C1S35A2xlC6uB/HzqSScDyFY8lpnD536UQXgbLF/ShnBcaFIeJPUEk/gkr6Z4dJ+ZLFCPDzrusLHTl1jsFT4ti4/yQvdKjP0Ja19B4MSuWidqE38c/OYbw0ZxP/mL2Rt//S0Gv+x4pUOIgI4/peGJppGU5OpKRxLDmN48mun0mXeOw5msLqP09wPCWNTOelt7ZKFPPx3BopUYygQGuLpEQxygda4VKiGKWL+9r2h7Jp/0mGTInj5Nl0xvaJok2D4KuPpJS6bg/F1uDIqXN88FM8N5UO4Jm29a4+UgFQpMLhUor5OqhUOoBKpa/tctNOp+H0uQyOJadeEB5Z4ZLkFjA7Dp8hKTmNs+mZl5yWr0MolxUgJV0hkvW8fKAVJm7t5UoWy5WrP87ffIinZqyjbAk/vhrWjNAqeg8GpfLS021u4fCpVD74KZ7gMgE82LTgnwVY5MPhejkcQpkSfpQp4UetazyZ52xaJkkpaSSdSXP9TE7l2BlXgBxPSct+vvXgKZKS0ziRkn7ZaZUO8L3q7i33LZcSxXyyt06MMXz66y7enreNhtXKMq5P5DWHolLqxokIb94fRuKZVF6es4kKgf60Cy3Yd0wscgekvUFGppMTZ9NdWyFnrABJdoVL9vPkVJKS062faaRnXno9+vs6soPDxyFsSDhJx4aVeadbI687RqKUt0tJy6DXuBVsO3iKaYOb5vt1yq7ngLSGQyFgjOFMasb5XVrZWygXPo6npNH61mAeuaN2obq0sFLeJCk5ja5jfudYchqzhjWjbnCpfJu3hoNSShVg+5JSeGDM7/g5hG+GN8+3W+xeTzjoCe1KKZXPqgeV4LP+0Zw6l0H/z1Zy8uzljzPaRcNBKaVsEFa1DJ88FMnOxDM8/HkcqRmXPqvRLhoOSillk9vrVmBkt0Ys35XEMzPX47zMd6jsoKeyKqWUjTo3rsqRU64bBVUs5c+I+xoUiG9RazgopZTNhrSsxaFT55jw225uKhPAsDtq212ShoNSShUEL95zK0dOp/LWD9uoVMqfB5pUs7UeDQellCoAHA5hZLeGHD2dyt9nbaBCoD8tb7Hvnip6QFoppQoIf18fPu0bSZ1KgQybupqNCfbdalTDQSmlCpDSAX5MHhhDuRLFGDBpJX8eS7alDg0HpZQqYIJLu241muE09Ju4kqNnUvO9Bg0HpZQqgGpXDGRCv2gOnTrHoEmrSE7N31uNajgopVQBFVmjHB/2asLG/Sd59Is1pGc6823eGg5KKVWAtWkQzJv3h/PzH4k8//VG8utiqXoqq1JKFXC9Ym7m8KlzjFq4g5vK+PO3dvXzfJ665aCUUl7gybvr0iumOj9sOpQvxx90y0EppbyAiPB65zCSUzMp6Z/3b90aDkop5SV8fRyUKZE/O3x0t5JSSikPGg5KKaU85DgcRMRHRNaKyP+s1yEiskJE4kVkpogUs9r9rdfx1vCabtN4wWr/Q0Ta5bQmpZRSOZMbWw5PAlvdXr8NvGeMqQMcBwZZ7YOA41b7e1Y/RKQB0BMIBdoDH4uITy7UpZRS6gblKBxEpBrQERhvvRagFTDL6jIZ6GI972y9xhp+t9W/MzDDGJNqjNkNxAMxOalLKaVUzuR0y2EU8Hcg6zvd5YETxpisk3ATgKrW86rAPgBr+Emrf3b7Jca5gIgMFZE4EYlLTEzMYelKKaUu54bDQUTuBY4YY1bnYj1XZIwZa4yJMsZEVaxo300wlFKqsMvJ9xyaA51E5B4gACgNvA+UFRFfa+ugGrDf6r8fqA4kiIgvUAY45taexX0cpZRSNrjhLQdjzAvGmGrGmJq4Dij/ZIx5EFgMdLW69QO+tZ7PtV5jDf/JuK4gNRfoaZ3NFALUBVbeaF1KKaVyLi++If0cMENE3gDWAhOs9gnA5yISDyThChSMMZtF5EtgC5ABPGqMycyDupRSSl0jya/Lv+a2qKgoExcXZ3cZSinlNURktTEm6lr66jeklVJKedBwUEop5UHDQSmllAcNB6WUUh40HJRSSnnQcFBKKeVBw0EppZQHDQellFIeNByUUkp50HBQSinlQcNBKaWUBw0HpZRSHjQclFJKedBwUEop5UHDQSmllAcNB6WUUh40HJRSSnnQcFBKKeVBw0EppZQHDQellFIeNByUUkp50HBQSinlQcNBKaWUBw0HpZRSHjQclFJKebjhcBCR6iKyWES2iMhmEXnSag8SkQUissP6Wc5qFxH5QETiRWSDiDRxm1Y/q/8OEemX88VSSimVEznZcsgA/mqMaQDEAo+KSAPgeWCRMaYusMh6DdABqGs9hgJjwBUmwAigKRADjMgKFKWUUva44XAwxhw0xqyxnp8GtgJVgc7AZKvbZKCL9bwzMMW4LAfKikhloB2wwBiTZIw5DiwA2t9oXUoppXIuV445iEhNIAJYAQQbYw5agw4BwdbzqsA+t9ESrLbLtSullLJJjsNBRAKBr4GnjDGn3IcZYwxgcjoPt3kNFZE4EYlLTEzMrckqpZS6SI7CQUT8cAXDNGPMN1bzYWt3EdbPI1b7fqC62+jVrLbLtXswxow1xkQZY6IqVqyYk9KVUkpdQU7OVhJgArDVGPOu26C5QNYZR/2Ab93a+1pnLcUCJ63dT/OBtiJSzjoQ3dZqU0opZRPfHIzbHOgDbBSRdVbbP4C3gC9FZBDwJ9DdGvY9cA8QD6QAAwCMMUki8jqwyur3T2NMUg7qUkoplUPiOizgfaKiokxcXJzdZSillNcQkdXGmKhr6avfkFZKKeVBw0EppZQHDQellFIeNByUUkp50HBQSinlQcNBKaWUBw0HpZRSHjQclFJKedBwUEop5UHDQSmllAcNB6WUUh40HJRSSnnQcFBKKeVBw0EppZQHDQellFIeNByUUkp50HBQSinlQcNBKaWUBw0HpZRSHjQclFJKedBwUEop5UHDQSmllAcNB6WUUh40HJRSSnnQcFBKKeVBw0EppZSHAhMOItJeRP4QkXgRed7uepRSqigrEOEgIj7AaKAD0ADoJSIN7K1KKaWKLl+7C7DEAPHGmF0AIjID6Axsye0ZOX//CDLTwOkE4wQDGCfGaf00TjAG43RijAHr4XruxDitn8aA07j1t8aH7OFZfc+Pn9XPbTzInpZrvKzhXDhedn+5qN2AVeaF7WRP371fNgERsZ64GkQExGS3uQ8X3J4LgMMaz31agIg1OUf2dC+clmQ3n3+eNVGxpi3gcHstYs3PuM3HgYi1erKeZy2Lw4HJGg3BiGTXatzmZbI+GomDrA5GztduxG24VasBjMORvVwGt/lZpZusZXWIq7/778/hwFjjuWp2YMSACCZ7HVjzEddrg1i1uJZFHNbz7Brchrs9z1o5xvodZ63+838GgsFk/aKw/lSsaRprfm59jblgOhdM47J9zQV9sueTtd4vmKf79MBp3MaRy0z3otfuC5i93Mathov6XrAs5vy8L1V31vKZrJHOLw6C4CsOfBw++CD4OAQf8cFXXO0OEXxw4IvgIw4cWf2tdtdP8ME1jgOx+rqeW28crpn7+MHNseS1ghIOVYF9bq8TgKZ5MaOPJ9dGxD8vJq1U/jNO9xeXeX4pcpXhuTyeFIidFPbKXlfuMe0Ek3lB2/nYyko59/4Gw2keHVd0wuGaiMhQYCjAzTfffEPTqOi3Ad8T57hwZVx6JVz42cJqu6CP9dOc/8xzcX/3Pp6f3cz5zybmUv3P9zXZdV5Yq2T/wbjP/6KaDBf1cf8Hd/tU7fqMYjW7f8o7/ylU3MdzH561FXBxu9v0soa7f3J1n7/nG4941nrRMOM+uvu05fy4WXO++BPv5X4P5z9Vei6vcXt+pdovP+7llsX9+cWfsi/8PV56PKuvIXvr6uI1fcnOl3F9EXDhGr1av+uZtlyhxvPTzPqduY13DXVccn5Xmq65/HSNnF9Pl343cH/LF882t3WGOb8V5jktwRl4o8F+fQpKOOwHqru9rma1XcAYMxYYCxAVFXW1v5pL6vHRczcymlJKFSkFZVtvFVBXREJEpBjQE5hrc01KKVVkFYgtB2NMhog8BswHfICJxpjNNpellFJFVoEIBwBjzPfA93bXoZRSquDsVlJKKVWAaDgopZTyoOGglFLKg4aDUkopDxoOSimlPIi56jcQCyYRSQT+vMzgCsDRfCwnvxTW5QJdNm+ly+ZdahhjKl5LR68NhysRkThjTJTddeS2wrpcoMvmrXTZCi/draSUUsqDhoNSSikPhTUcxtpdQB4prMsFumzeSpetkCqUxxyUUkrlTGHdclBKKZUDXh0OIlJdRBaLyBYR2SwiT1rtQSKyQER2WD/L2V3rjRIRHxFZKyL/s16HiMgKEYkXkZnWJc69joiUFZFZIrJNRLaKSLPCsN5E5Gnrb3GTiEwXkQBvXmciMlFEjojIJre2S64ncfnAWs4NItLEvsqv7DLL9V/r73GDiMwWkbJuw16wlusPEWlnT9X5y6vDAcgA/mqMaQDEAo+KSAPgeWCRMaYusMh67a2eBLa6vX4beM8YUwc4Dgyypaqcex+YZ4ypDzTCtYxevd5EpCrwBBBljAnDdfn5nnj3OpsEtL+o7XLrqQNQ13oMBcbkU403YhKey7UACDPGNAS2Ay8AWO8pPYFQa5yPRcQn/0q1h1eHgzHmoDFmjfX8NK43mKpAZ2Cy1W0y0MWeCnNGRKoBHYHx1msBWgGzrC5euWwiUgZoCUwAMMakGWNOUDjWmy9QXER8gRLAQbx4nRljfgWSLmq+3HrqDEwxLsuBsiJSOX8qvT6XWi5jzI/GmAzr5XJcd6QE13LNMMakGmN2A/FATL4VaxOvDgd3IlITiABWAMHGmIPWoENAsE1l5dQo4O9A1p3JywMn3P6AE3CFobcJARKBz6xdZuNFpCRevt6MMfuBkcBeXKFwElhN4Vhn7i63nqoC+9z6efOyDgR+sJ4XpuW6ZoUiHEQkEPgaeMoYc8p9mHGdjuV1p2SJyL3AEWPMartryQO+QBNgjDEmAkjmol1I3rjerH3vnXGFXxWgJJ67LgoVb1xPVyMiL+LaZT3N7lrs5PXhICJ+uIJhmjHmG6v5cNbmrPXziF315UBzoJOI7AFm4No18T6uTfWsO/hVA/bbU16OJAAJxpgV1utZuMLC29dba2C3MSbRGJMOfINrPRaGdebucutpP1DdrZ/XLauI9AfuBR4058/z9/rluhFeHQ7WPvgJwFZjzLtug+YC/azn/YBv87u2nDLGvGCMqWaMqYnrYNhPxpgHgcVAV6ubty7bIWCfiNSzmu4GtuD9620vECsiJay/zazl8vp1dpHLrae5QF/rrKVY4KTb7qcCT0Ta49qN28kYk+I2aC7QU0T8RSQE1wH3lXbUmK+MMV77AG7HtUm7AVhnPe7BtW9+EbADWAgE2V1rDpfzTuB/1vNauP4w44GvAH+767vBZWoMxFnrbg5QrjCsN+A1YBuwCfgc8PfmdQZMx3X8JB3XFt+gy60nQIDRwE5gI66ztmxfhutYrnhcxxay3ks+cev/orVcfwAd7K4/Px76DWmllFIevHq3klJKqbyh4aCUUsqDhoNSSikPGg5KKaU8aDgopZTyoOGglFLKg4aDUkopDxoOSimlPPw//dxRNyxb2QMAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "df.sort_values(by=['Views'], ascending=False).head(5).sort_index().plot()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "df.sort_values(by=['Views'], ascending=False)[['Views','title']].head(3).sort_index().plot()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "df['title_short'] = df['title'].str[:20]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titleViewsLikeDislikeFavoriteCommentvideoIDtagsnameurltitle_short
0PyCharm/IntelliJ fast and auto change of the color theme41.00.00.00.02.0https://www.youtube.com/embed/SsX9Fl958W0https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg<a href=\"https://www.youtube.com/embed/SsX9Fl958W0\">XXXXX</a>PyCharm/IntelliJ fas
1How to add weather desklet to Linux Mint 19291.00.00.00.00.0https://www.youtube.com/embed/-FPY_e0BdJshttps://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg<a href=\"https://www.youtube.com/embed/-FPY_e0BdJs\">XXXXX</a>How to add weather d
2How to easy integrate Google Calendar to Desktop for Linux Mint226.01.00.00.00.0https://www.youtube.com/embed/2evIujisdD0https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg<a href=\"https://www.youtube.com/embed/2evIujisdD0\">XXXXX</a>How to easy integrat
3Pandas use a list of values to select rows from a column45.03.00.00.010.0https://www.youtube.com/embed/jlSbo5wmTPQhttps://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg<a href=\"https://www.youtube.com/embed/jlSbo5wmTPQ\">XXXXX</a>Pandas use a list of
4Pandas count and percentage by value for a column63.03.00.00.00.0https://www.youtube.com/embed/P5pxJkv71BUhttps://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg<a href=\"https://www.youtube.com/embed/P5pxJkv71BU\">XXXXX</a>Pandas count and per
\n", + "
" + ], + "text/plain": [ + " title Views \\\n", + "0 PyCharm/IntelliJ fast and auto change of the color theme 41.0 \n", + "1 How to add weather desklet to Linux Mint 19 291.0 \n", + "2 How to easy integrate Google Calendar to Desktop for Linux Mint 226.0 \n", + "3 Pandas use a list of values to select rows from a column 45.0 \n", + "4 Pandas count and percentage by value for a column 63.0 \n", + "\n", + " Like Dislike Favorite Comment \\\n", + "0 0.0 0.0 0.0 2.0 \n", + "1 0.0 0.0 0.0 0.0 \n", + "2 1.0 0.0 0.0 0.0 \n", + "3 3.0 0.0 0.0 10.0 \n", + "4 3.0 0.0 0.0 0.0 \n", + "\n", + " videoID \\\n", + "0 https://www.youtube.com/embed/SsX9Fl958W0 \n", + "1 https://www.youtube.com/embed/-FPY_e0BdJs \n", + "2 https://www.youtube.com/embed/2evIujisdD0 \n", + "3 https://www.youtube.com/embed/jlSbo5wmTPQ \n", + "4 https://www.youtube.com/embed/P5pxJkv71BU \n", + "\n", + " tags \\\n", + "0 https://i.ytimg.com/vi/SsX9Fl958W0/hqdefault.jpg \n", + "1 https://i.ytimg.com/vi/-FPY_e0BdJs/hqdefault.jpg \n", + "2 https://i.ytimg.com/vi/2evIujisdD0/hqdefault.jpg \n", + "3 https://i.ytimg.com/vi/jlSbo5wmTPQ/hqdefault.jpg \n", + "4 https://i.ytimg.com/vi/P5pxJkv71BU/hqdefault.jpg \n", + "\n", + " nameurl \\\n", + "0 XXXXX \n", + "1 XXXXX \n", + "2 XXXXX \n", + "3 XXXXX \n", + "4 XXXXX \n", + "\n", + " title_short \n", + "0 PyCharm/IntelliJ fas \n", + "1 How to add weather d \n", + "2 How to easy integrat \n", + "3 Pandas use a list of \n", + "4 Pandas count and per " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "df.set_index('title_short', inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEBCAYAAACT92m7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAE2hJREFUeJzt3X+0XWV95/H3BxOMNgEE77CEgKGFoUZQRo6IMqHiD4J11RAblHQcMaBMR5xOdZWBLl3YH/9Ix1HGH6WyhCFMuwiUmZF0IgSG5ZpkudByk1KTkCJRqV5AExPREcvv7/xxd5zDfRJyk5xw5ibv11pnnb2/+9nPfZ6slXzu3s8+OakqJEnqd9CwByBJ+v+P4SBJahgOkqSG4SBJahgOkqSG4SBJahgOkqSG4SBJahgOkqTGtGEPYE+9/OUvrzlz5gx7GJI0paxZs+bHVTWyq3ZTNhzmzJnD6OjosIchSVNKkn+cTDtvK0mSGoaDJKlhOEiSGlN2zUGSduapp55ibGyMxx9/fNhDGZoZM2Ywe/Zspk+fvkfnGw6S9jtjY2PMmjWLOXPmkGTYw3nBVRVbt25lbGyM4447bo/68LaSpP3O448/zhFHHHFABgNAEo444oi9unIyHCTtlw7UYNhub+dvOEiSGoaDJA3YWWedxcqVK59Tu+qqq1iyZAmLFi0a0qh2j+EgSQO2ePFili1b9pzasmXLWLJkCbfccsuQRrV7DAdJGrBFixaxYsUKnnzySQAefPBBHn74YY455hhOOukkAJ555hkuvfRSXv/61/Oa17yGL33pSwBccsklLF++HICFCxdy4YUXAnDdddfx8Y9/nMcee4x3vvOdvPa1r+Wkk07ipptu2idz8FFWSfu1P/6bDdz38M8G2ufcow7hk7/16p0eP/zwwznttNO47bbbWLBgAcuWLeM973nPcxaJr732Wg499FDuuecennjiCc444wzOPvts5s2bx+rVq3nXu97FQw89xCOPPALA6tWrOf/887n99ts56qijWLFiBQA//elPBzq37bxykKR9oP/W0rJly1i8ePFzjt9xxx3ccMMNnHLKKbzhDW9g69atPPDAA78Mh/vuu4+5c+dy5JFH8sgjj3D33Xfzpje9iZNPPpk777yTyy67jNWrV3PooYfuk/F75SBpv/Z8v+HvSwsWLOCjH/0oa9eu5Re/+AWnnnoqDz744C+PVxWf//znmT9/fnPuo48+yu23386ZZ57Jtm3buPnmm5k5cyazZs1i1qxZrF27lq9+9at84hOf4K1vfStXXHHFwMfvlYMk7QMzZ87krLPO4sILL2yuGgDmz5/P1VdfzVNPPQXAt7/9bR577DEATj/9dK666irOPPNM5s2bx6c//WnmzZsHwMMPP8xLX/pS3ve+93HppZeydu3afTJ+rxwkaR9ZvHgxCxcubJ5cAvjgBz/Igw8+yOte9zqqipGREb7yla8AMG/ePO644w6OP/54XvnKV7Jt27ZfhsO6deu49NJLOeigg5g+fTpXX331Phl7qmqfdLyv9Xq98st+JO3Ixo0bedWrXjXsYQzdjv4ckqypqt6uzvW2kiSpYThIkhqGg6T90lS9ZT4oezt/w0HSfmfGjBls3br1gA2I7d/nMGPGjD3uw6eVJO13Zs+ezdjYGFu2bBn2UIZm+zfB7SnDQdJ+Z/r06Xv8DWga520lSVLDcJAkNQwHSVLDcJAkNQwHSVJjl+GQ5Lokm5Os76udl2RDkmeT9Prq05MsTbIuycYkf9h37Jwk9yfZlOTyvvpxSb7Z1W9KcvAgJyhJ2n2TuXK4HjhnQm098G5g1YT6ecCLq+pk4FTg3ySZk+RFwBeBdwBzgcVJ5nbnXAl8tqqOB34CXLQnE5EkDc4uw6GqVgHbJtQ2VtX9O2oO/EqSacBLgCeBnwGnAZuq6rtV9SSwDFiQ8e/Mewuw/Ru3lwLn7ulkJEmDMeg1h1uAx4BHgO8Dn66qbcDRwA/62o11tSOAR6vq6Ql1SdIQDfoT0qcBzwBHAS8DVif5X4PqPMnFwMUAxx577KC6lSRNMOgrh98Bbq+qp6pqM/B1oAc8BBzT1252V9sKHNbdhuqv71BVXVNVvarqjYyMDHjokqTtBh0O32d8DYEkvwKcDvwDcA9wQvdk0sHA+cDyGv8vE78GLOrOvwC4dcBjkiTtpsk8ynojcDdwYpKxJBclWZhkDHgjsCLJyq75F4GZSTYwHgj/paq+1a0pfARYCWwEbq6qDd05lwEfS7KJ8TWIawc5QUnS7vM7pCXpAOJ3SEuS9pjhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElq7DIcklyXZHOS9X2185JsSPJskt6E9q9Jcnd3fF2SGV391G5/U5LPJUlXPzzJnUke6N5fNuhJSpJ2z2SuHK4HzplQWw+8G1jVX0wyDfhL4Her6tXAm4GnusNXAx8CTuhe2/u8HLirqk4A7ur2JUlDtMtwqKpVwLYJtY1Vdf8Omp8NfKuq/r5rt7WqnknyCuCQqvpGVRVwA3Bud84CYGm3vbSvLkkakkGvOfxzoJKsTLI2yX/o6kcDY33txroawJFV9Ui3/UPgyAGPSZK0m6btg/7+JfB64BfAXUnWAD+dzMlVVUlqZ8eTXAxcDHDsscfu/WglSTs06CuHMWBVVf24qn4BfBV4HfAQMLuv3eyuBvCj7rYT3fvmnXVeVddUVa+qeiMjIwMeuiRpu0GHw0rg5CQv7RanfwO4r7tt9LMkp3dPKb0fuLU7ZzlwQbd9QV9dkjQkk3mU9UbgbuDEJGNJLkqyMMkY8EZgRZKVAFX1E+AzwD3AvcDaqlrRdfVh4MvAJuA7wG1d/VPA25M8ALyt25ckDVHGHx6aenq9Xo2Ojg57GJI0pSRZU1W9XbXzE9KSpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElq7DIcklyXZHOS9X2185JsSPJskt4Ozjk2yc+T/EFf7Zwk9yfZlOTyvvpxSb7Z1W9KcvAgJiZJ2nOTuXK4HjhnQm098G5g1U7O+Qxw2/adJC8Cvgi8A5gLLE4ytzt8JfDZqjoe+Alw0WQHL0naN3YZDlW1Ctg2obaxqu7fUfsk5wLfAzb0lU8DNlXVd6vqSWAZsCBJgLcAt3TtlgLn7vYsJEkDNdA1hyQzgcuAP55w6GjgB337Y13tCODRqnp6Qn1n/V+cZDTJ6JYtWwY3cEnScwx6QfqPGL9F9PMB9wtAVV1TVb2q6o2MjOyLHyFJAqYNuL83AIuS/BlwGPBskseBNcAxfe1mAw8BW4HDkkzrrh621yVJQzTQcKiqedu3k/wR8POq+kKSacAJSY5j/B//84HfqapK8jVgEePrEBcAtw5yTJKk3TeZR1lvBO4GTkwyluSiJAuTjAFvBFYkWfl8fXRXBR8BVgIbgZuravuC9WXAx5JsYnwN4to9n44kaRBSVcMewx7p9Xo1Ojo67GFI0pSSZE1VNZ9Pm8hPSEuSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGrsMhyTXJdmcZH1f7bwkG5I8m6TXV397kjVJ1nXvb+k7dmpX35Tkc0nS1Q9PcmeSB7r3lw16kpKk3TOZK4frgXMm1NYD7wZWTaj/GPitqjoZuAD4r33HrgY+BJzQvbb3eTlwV1WdANzV7UuShmiX4VBVq4BtE2obq+r+HbT9u6p6uNvdALwkyYuTvAI4pKq+UVUF3ACc27VbACzttpf21SVJQ7Iv1xx+G1hbVU8ARwNjfcfGuhrAkVX1SLf9Q+DIfTgmSdIkTNsXnSZ5NXAlcPbunFdVlaSep9+LgYsBjj322L0aoyRp5wZ+5ZBkNvA/gPdX1Xe68kPA7L5ms7sawI+6205075t31ndVXVNVvarqjYyMDHrokqTOQMMhyWHACuDyqvr69np32+hnSU7vnlJ6P3Brd3g544vXdO+3Ikkaqsk8ynojcDdwYpKxJBclWZhkDHgjsCLJyq75R4DjgSuS3Nu9/ll37MPAl4FNwHeA27r6p4C3J3kAeFu3L0kaoow/PDT19Hq9Gh0dHfYwJGlKSbKmqnq7aucnpCVJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktTYZTgkuS7J5iTr+2rnJdmQ5NkkvQnt/zDJpiT3J5nfVz+nq21Kcnlf/bgk3+zqNyU5eFCTkyTtmclcOVwPnDOhth54N7Cqv5hkLnA+8OrunD9P8qIkLwK+CLwDmAss7toCXAl8tqqOB34CXLRnU5EkDcouw6GqVgHbJtQ2VtX9O2i+AFhWVU9U1feATcBp3WtTVX23qp4ElgELkgR4C3BLd/5S4Nw9no0kaSAGveZwNPCDvv2xrraz+hHAo1X19IS6JGmIptSCdJKLk4wmGd2yZcuwhyNJ+61Bh8NDwDF9+7O72s7qW4HDkkybUN+hqrqmqnpV1RsZGRnowCVJ/8+gw2E5cH6SFyc5DjgB+FvgHuCE7smkgxlftF5eVQV8DVjUnX8BcOuAxyRJ2k2TeZT1RuBu4MQkY0kuSrIwyRjwRmBFkpUAVbUBuBm4D7gduKSqnunWFD4CrAQ2Ajd3bQEuAz6WZBPjaxDXDnaKkqTdlfFf3qeeXq9Xo6Ojwx6GJE0pSdZUVW9X7abUgrQk6YVhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGpMKhyTXJdmcZH1f7fAkdyZ5oHt/WVc/NMnfJPn7JBuSLOk754Ku/QNJLuirn5pkXZJNST6XJIOcpCRp90z2yuF64JwJtcuBu6rqBOCubh/gEuC+qnot8GbgPyU5OMnhwCeBNwCnAZ/cHijA1cCHgBO618SfJUl6AU0qHKpqFbBtQnkBsLTbXgqcu705MKv77X9md97TwHzgzqraVlU/Ae4EzknyCuCQqvpGVRVwQ19fkqQhmLYX5x5ZVY902z8Ejuy2vwAsBx4GZgHvrapnkxwN/KDv/DHg6O41toO6JGlIBrIg3f3GX93ufOBe4CjgFOALSQ4ZxM9JcnGS0SSjW7ZsGUSXkqQd2Jtw+FF3S4jufXNXXwL89xq3Cfge8OvAQ8AxfefP7moPddsT642quqaqelXVGxkZ2YuhS5Kez96Ew3Jg+xNHFwC3dtvfB94KkORI4ETgu8BK4OwkL+sWos8GVna3pn6W5PRuneL9fX1JkoZgUmsOSW5k/MmjlycZY/ypo08BNye5CPhH4D1d8z8Frk+yDghwWVX9uOvnT4F7unZ/UlXbF7k/zPgTUS8BbutekqQhyfhywdTT6/VqdHR02MOQpCklyZqq6u2qnZ+QliQ1DAdJUsNwkCQ1DAdJUsNwkCQ1puzTSkm2MP4I7VTycuDHwx7EC8w5Hxic89Txyqra5aeIp2w4TEVJRifzCNn+xDkfGJzz/sfbSpKkhuEgSWoYDi+sa4Y9gCFwzgcG57yfcc1BktTwykGS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0EHlCSHJflwt31Uklu67VOS/GZfuw8k+cKAfuabk/zPvezjA0mOGsR4pMkwHHSgOYzx7yynqh6uqkVd/RTgN3d61hAleRHwAcBw0AvGcNCB5lPAryW5N8lfJ1mf5GDgT4D3dvX39p+QZCTJf0tyT/c6Y2edJ/mNro97k/xdklndoZlJbknyD0n+Kkm69m/t2q1Lcl2SF3f1B5NcmWQtsBjoAX/V9fuSffDnIj2H4aADzeXAd6rqFOBSgKp6ErgCuKmqTqmqmyac85+Bz1bV64HfBr78PP3/AXBJ1/884J+6+r8Afh+YC/wqcEaSGcD1wHur6mRgGvBv+/raWlWvq6q/BEaBf9WN75+Q9jHDQdq1twFfSHIvsBw4JMnMnbT9OvCZJL8HHFZVT3f1v62qsap6FrgXmAOcCHyvqr7dtVkKnNnX18SQkl4w04Y9AGkKOAg4vaoe31XDqvpUkhWMr198Pcn87tATfc2eYXJ/9x7b7ZFKA+KVgw40/weYtRt1gDuAf7d9J8kpO+s8ya9V1bqquhK4B/j15xnL/cCcJMd3+/8a+N+7OW5pnzAcdECpqq2M/0a/HviPfYe+Bszd0YI08HtAL8m3ktwH/O7z/Ijf7xa5vwU8Bdz2PGN5HFgC/HWSdcCzwF/spPn1wF+4IK0Xiv9ltySp4ZWDJKnhgrS0B5IsAf79hPLXq+qSYYxHGjRvK0mSGt5WkiQ1DAdJUsNwkCQ1DAdJUsNwkCQ1/i875WAmq/t/RQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "df.sort_values(by=['Views'], ascending=False)[['Views']].head(1).plot()" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "df.reset_index(inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "df.sort_values(by=['Views'], ascending=False)[['Views']].head(5).T.plot.bar()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEFCAYAAAAIZiutAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAFQVJREFUeJzt3X+Q3XV97/HnGwikkJQfYZtKkpL0kmlFVMQ1pDrJgOklQbwNOMiQqZIBbDoVrbV3cg0Xx1QsM8hYQanmihIFxxqY+INM+ZnyQ9Op0oQgAomajESyIUqaBbRQCsH3/eN8Qg75bMiyZ7PfTfb5mNnZ7/fz+Xy/+97DHl75fr6fc05kJpIktTuo6QIkScOP4SBJqhgOkqSK4SBJqhgOkqSK4SBJqhgOkqSK4SBJqhgOkqTKIU0XMFDHHntsTp48uekyJGm/8cADD/xHZnb1Z+x+Gw6TJ09mzZo1TZchSfuNiPhFf8c6rSRJqhgOkqSK4SBJquy39xwkaU9efPFFenp6eP7555supRGjR49m4sSJjBo1asDnMBwkHXB6enoYO3YskydPJiKaLmdIZSbbt2+np6eHKVOmDPg8TitJOuA8//zzjBs3bsQFA0BEMG7cuI6vmgwHSQekkRgMOw3G7244SJIq3nOQdMCbvOjWQT3fpivPetX+008/nUWLFjF79uyX26655hoeeughfvOb37B8+fJBrWdfGLHhMNh/LAO1tz8ySfufefPmsWzZsleEw7Jly7jqqquYOXNmg5X1n9NKkjTIzj33XG699VZeeOEFADZt2sQTTzzBpEmTOOmkkwB46aWXWLhwIW9729t405vexJe+9CUALrnkElasWAHAOeecw0UXXQTA0qVLueyyy3j22Wc566yzePOb38xJJ53ETTfdtE9+B8NBkgbZMcccw7Rp07j99tuB1lXDeeed94obxddffz1HHnkkq1evZvXq1Xz5y1/mscceY8aMGaxatQqALVu2sG7dOgBWrVrFzJkzueOOOzjuuON46KGHeOSRR5gzZ84++R0MB0naB3ZOLUErHObNm/eK/rvuuosbb7yRk08+mVNPPZXt27ezYcOGl8Nh3bp1nHjiiYwfP56tW7fygx/8gLe//e288Y1vZOXKlXzsYx9j1apVHHnkkfuk/hF7z0GS9qW5c+fy0Y9+lLVr1/Lcc8/x1re+lU2bNr3cn5lce+21r7gvsdPTTz/NHXfcwcyZM+nt7eXmm29mzJgxjB07lrFjx7J27Vpuu+02Pv7xjzNr1iw+8YlPDHr9XjlI0j4wZswYTj/9dC666KLqqgFg9uzZLFmyhBdffBGAn/3sZzz77LMATJ8+nWuuuYaZM2cyY8YMPvOZzzBjxgwAnnjiCQ4//HDe9773sXDhQtauXbtP6vfKQdIBr6lVgfPmzeOcc855eXqp3Qc+8AE2bdrEKaecQmbS1dXFd7/7XQBmzJjBXXfdxQknnMDxxx9Pb2/vy+Hw8MMPs3DhQg466CBGjRrFkiVL9kntkZn75MT7Wnd3d3byYT8uZZUOXOvXr+f1r39902U0qq/HICIeyMzu/hzvtJIkqWI4SJIqhoOkA9L+OmU+GAbjdzccJB1wRo8ezfbt20dkQOz8PIfRo0d3dB5XK0k64EycOJGenh62bdvWdCmN2PlJcJ0wHCQdcEaNGtXRp6DJaSVJUh8MB0lSxXCQJFUMB0lSxXCQJFUMB0lSxXCQJFX2Gg4RsTQinoyIR9rajomIlRGxoXw/urRHRHw+IjZGxI8j4pS2Y+aX8RsiYn5b+1sj4uFyzOej/XP0JEmN6M+Vw9eA3T+kdBFwd2ZOBe4u+wBnAlPL1wJgCbTCBFgMnApMAxbvDJQy5i/ajts3H4gqSeq3vYZDZn4f6N2teS5wQ9m+ATi7rf3GbPkhcFREvA6YDazMzN7MfApYCcwpfb+bmT/M1pug3Nh2LklSQwZ6z2F8Zm4t278ExpftCcDmtnE9pe3V2nv6aJckNajjG9LlX/xD8taHEbEgItZExJqR+oZakjQUBhoOvypTQpTvT5b2LcCktnETS9urtU/so71PmXldZnZnZndXV9cAS5ck7c1Aw2EFsHPF0Xzglrb2C8qqpenAM2X66U7gjIg4utyIPgO4s/T9OiKml1VKF7SdS5LUkL2+ZXdEfBM4DTg2InporTq6Erg5Ii4GfgGcV4bfBrwL2Ag8B1wIkJm9EfEpYHUZd3lm7rzJ/UFaK6J+B7i9fEmSGrTXcMjMeXvomtXH2AQu2cN5lgJL+2hfA5y0tzokSUPHV0hLkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiqGgySpYjhIkiodhUNEfDQiHo2IRyLimxExOiKmRMT9EbExIm6KiEPL2MPK/sbSP7ntPJeW9p9GxOzOfiVJUqcGHA4RMQH4a6A7M08CDgbOBz4NXJ2ZJwBPAReXQy4GnirtV5dxRMSJ5bg3AHOAL0bEwQOtS5LUuU6nlQ4BficiDgEOB7YC7wSWl/4bgLPL9tyyT+mfFRFR2pdl5n9n5mPARmBah3VJkjow4HDIzC3AZ4DHaYXCM8ADwNOZuaMM6wEmlO0JwOZy7I4yflx7ex/HSJIa0Mm00tG0/tU/BTgOOILWtNA+ExELImJNRKzZtm3bvvxRkjSidTKt9KfAY5m5LTNfBL4NvAM4qkwzAUwEtpTtLcAkgNJ/JLC9vb2PY14hM6/LzO7M7O7q6uqgdEnSq+kkHB4HpkfE4eXewSxgHXAvcG4ZMx+4pWyvKPuU/nsyM0v7+WU10xRgKvDvHdQlSerQIXsf0rfMvD8ilgNrgR3Ag8B1wK3Asoj4+9J2fTnkeuDrEbER6KW1QonMfDQibqYVLDuASzLzpYHWJUnq3IDDASAzFwOLd2v+OX2sNsrM54H37uE8VwBXdFKLJGnw+AppSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLFcJAkVQwHSVLlkKYLUPMmL7q16RIA2HTlWU2XIKno6MohIo6KiOUR8ZOIWB8RfxIRx0TEyojYUL4fXcZGRHw+IjZGxI8j4pS288wv4zdExPxOfylJUmc6nVb6HHBHZv4x8GZgPbAIuDszpwJ3l32AM4Gp5WsBsAQgIo4BFgOnAtOAxTsDRZLUjAGHQ0QcCcwErgfIzBcy82lgLnBDGXYDcHbZngvcmC0/BI6KiNcBs4GVmdmbmU8BK4E5A61LktS5Tq4cpgDbgK9GxIMR8ZWIOAIYn5lby5hfAuPL9gRgc9vxPaVtT+2ViFgQEWsiYs22bds6KF2S9Go6CYdDgFOAJZn5FuBZdk0hAZCZCWQHP+MVMvO6zOzOzO6urq7BOq0kaTedhEMP0JOZ95f95bTC4ldluojy/cnSvwWY1Hb8xNK2p3ZJUkMGHA6Z+Utgc0T8UWmaBawDVgA7VxzNB24p2yuAC8qqpenAM2X66U7gjIg4utyIPqO0SZIa0unrHD4MfCMiDgV+DlxIK3BujoiLgV8A55WxtwHvAjYCz5WxZGZvRHwKWF3GXZ6ZvR3WJUnqQEfhkJk/Arr76JrVx9gELtnDeZYCSzupRZI0eHyFtNTGV4tLLb63kiSpYjhIkiqGgySpYjhIkirekJbUJ2/Oj2xeOUiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKliOEiSKoaDJKnih/1I0l6MxA8+8spBklQxHCRJFcNBklQxHCRJlY7DISIOjogHI+Kfy/6UiLg/IjZGxE0RcWhpP6zsbyz9k9vOcWlp/2lEzO60JklSZwbjyuEjwPq2/U8DV2fmCcBTwMWl/WLgqdJ+dRlHRJwInA+8AZgDfDEiDh6EuiRJA9RROETEROAs4CtlP4B3AsvLkBuAs8v23LJP6Z9Vxs8FlmXmf2fmY8BGYFondUmSOtPplcM1wP8Bflv2xwFPZ+aOst8DTCjbE4DNAKX/mTL+5fY+jpEkNWDA4RAR7waezMwHBrGevf3MBRGxJiLWbNu2bah+rCSNOJ1cObwD+LOI2AQsozWd9DngqIjY+crricCWsr0FmARQ+o8Etre393HMK2TmdZnZnZndXV1dHZQuSXo1Aw6HzLw0Mydm5mRaN5Tvycw/B+4Fzi3D5gO3lO0VZZ/Sf09mZmk/v6xmmgJMBf59oHVJkjq3L95b6WPAsoj4e+BB4PrSfj3w9YjYCPTSChQy89GIuBlYB+wALsnMl/ZBXZKkfhqUcMjM+4D7yvbP6WO1UWY+D7x3D8dfAVwxGLVIkjrnK6QlSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUMRwkSRXDQZJUGXA4RMSkiLg3ItZFxKMR8ZHSfkxErIyIDeX70aU9IuLzEbExIn4cEae0nWt+Gb8hIuZ3/mtJkjrRyZXDDuB/Z+aJwHTgkog4EVgE3J2ZU4G7yz7AmcDU8rUAWAKtMAEWA6cC04DFOwNFktSMAYdDZm7NzLVl+zfAemACMBe4oQy7ATi7bM8FbsyWHwJHRcTrgNnAyszszcyngJXAnIHWJUnq3KDcc4iIycBbgPuB8Zm5tXT9EhhfticAm9sO6ylte2rv6+csiIg1EbFm27Ztg1G6JKkPHYdDRIwBvgX8TWb+ur0vMxPITn9G2/muy8zuzOzu6uoarNNKknbTUThExChawfCNzPx2af5VmS6ifH+ytG8BJrUdPrG07aldktSQTlYrBXA9sD4zP9vWtQLYueJoPnBLW/sFZdXSdOCZMv10J3BGRBxdbkSfUdokSQ05pINj3wG8H3g4In5U2v4vcCVwc0RcDPwCOK/03Qa8C9gIPAdcCJCZvRHxKWB1GXd5ZvZ2UJckqUMDDofM/Fcg9tA9q4/xCVyyh3MtBZYOtBZJ0uDyFdKSpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpIrhIEmqGA6SpMqwCYeImBMRP42IjRGxqOl6JGkkGxbhEBEHA18AzgROBOZFxInNViVJI9ewCAdgGrAxM3+emS8Ay4C5DdckSSPWcAmHCcDmtv2e0iZJakBkZtM1EBHnAnMy8wNl//3AqZn5od3GLQAWlN0/An46pIXWjgX+o+Eahgsfi118LHbxsdhlODwWx2dmV38GHrKvK+mnLcCktv2Jpe0VMvM64LqhKmpvImJNZnY3Xcdw4GOxi4/FLj4Wu+xvj8VwmVZaDUyNiCkRcShwPrCi4ZokacQaFlcOmbkjIj4E3AkcDCzNzEcbLkuSRqxhEQ4AmXkbcFvTdbxGw2aKaxjwsdjFx2IXH4td9qvHYljckJYkDS/D5Z6DJGkYMRwkSRXDQZJUMRwkSRXDYRBExO1N19C0iPhg0zU0ISJ+PyKWRMQXImJcRPxdRDwcETdHxOuarm+oRMSHIuLYsn1CRHw/Ip6OiPsj4o1N1zccRMSfNV3DazFslrIOdxFxyp66gJOHspamRcTf7t4EXBoRowEy87NDX1VjvgbcChwB3At8A3gXcDbw/xg5byD5V5n5j2X7c8DVmfmdiDiN1uPwjsYqa0BEvGf3JuALEXEIQGZ+e+irem0Mh/5bDXyP1n/k3R01xLU07ZO0XpPyKLsej4OBsY1V1JzxmXkttK6eMvPTpf3aiLi4wbqGWvv/S34vM78DkJn3RcRI/Lu4idaLep9k13PkCOB/AQkYDgeQ9cBfZuaG3TsiYnMf4w9kbwD+gdYf+ycz87mImJ+Zn2y4ria0T83euFvfwUNZSMOWR8TXgMuB70TE3wDfAd4JPN5kYQ15O3AlsDozlwBExGmZeWGzZfWf9xz67+/Y8+P14SGso3GZ+Xhmvhf4N2BleVfdkeqWiBgDkJkf39kYESfQ/LsGD5nMvIzWlfU3gb8FPgXcDkwF/rzB0hqRmauB/wkcGhH3RsQ0WlcM+w1fIf0aRMQfAu+h9Q6yLwE/A/4pM3/daGENiogjaAXnqZk5s+FyGhERf0zr80fuz8z/bGufk5l3NFdZsyLi65n5/qbraFpEHAdcA3Rn5h82XU9/GQ79FBF/Dbwb+D6tG44PAk8D5wAfzMz7mqtOTYmIDwMfojXteDLwkcy8pfStzcw9LWQ4oEREX++i/E7gHoDM3K9W6shw6LeIeBg4OTNfiojDgdsy87SI+APglsx8S8MlDpmI+F3gUlqfu3F7Zv5TW98XM3PELGstfxd/kpn/GRGTgeXA1zPzcxHx4Ej5u4iItcA64Cu0pk+C1hTT+QCZ+b3mqht6EfH7wGLgt8AnaE09vwf4Ca1/QGxtsLx+8Z7Da7PzBv5hwM555seBUY1V1Iyv0nryfws4PyK+FRGHlb7pzZXViIN2TiVl5ibgNODMiPgsfa9sO1B1Aw8AlwHPlCvp/8rM7420YCi+RissN9Na4vxfwFnAKlpLe4c9w6H/vgKsjogvAz8AvgAQEV1Ab5OFNeB/ZOaizPxumS5YC9wTEeOaLqwBv4qIl1/nUoLi3bQ+EnLEvPgrM3+bmVcDFwKXRcQ/MrJXQ47PzGsz80rgqMz8dGZuLsuej2+6uP4Yyf/xXpMyTfAvwOuBf8jMn5T2bcBIuxF7WEQclJm/BcjMKyJiC637MWOaLW3IXQDsaG/IzB3ABRHxpWZKak5m9gDvjYizgBG7UIMDYImz9xz0mkXEVcBdmfkvu7XPAa7NzKnNVCYNDxFxOXBV++q10n4CcGVmDvvl34aDBlVEXJiZX226Dmm42l+eI4aDBlVEPJ6Zf9B0HdJwtb88R7znoNcsIn68py5g/FDWIg1HB8JzxHDQQIwHZgNP7dYetN5SQxrp9vvniOGggfhnYExm/mj3joi4b+jLkYad/f454j0HSVLFF8FJkiqGgySpYjhIkiqGgySpYjhIkir/H+DWFFYWxutlAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "df.sort_values(by=['Views'], ascending=False)[['Views']].head(5).plot.bar()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/scripts/1.python_wrap_lines.py b/scripts/1.python_wrap_lines.py new file mode 100644 index 0000000..2db7976 --- /dev/null +++ b/scripts/1.python_wrap_lines.py @@ -0,0 +1,42 @@ +import os + +size = 80 +file = 'budo' +folder = os.path.expanduser('~/Documents/Fortunes/') + +# Read and store the entire file line by line +with open(f'{folder}{file}.txt') as reader: + provers = reader.readlines() + +# wrap/collate lines by separators [",", " ", "."] +def collate(text, size): + new_text = [] + split_char = 1 + while split_char > 0: + comma = str.find(text, ',', size) + space = str.find(text, ' ', size) + dot = str.find(text, '.', size) + + split_char = min(max(comma, dot), max(comma, space), max(dot, space)) + + if text[:split_char]: + new_text.append(text[:split_char]) + text = text[split_char+1:].replace('\n', "") + + return new_text + +# write collated information to new(same) file +with open(f'{folder}{file}.txt', 'w') as writer: + for wisdom in provers: + if len(wisdom) > size: + collated = collate(wisdom, size) + for short in collated: + writer.write(short) + writer.write('\n') + else: + writer.write(wisdom) + +# Executing Shell Commands with Python +import os +myCmd = f'strfile -c % {folder}{file}.txt {folder}{file}.txt.dat' +os.system(myCmd) \ No newline at end of file diff --git a/scripts/__init__.py b/scripts/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/test.py b/test.py index 75d9766..9515637 100644 --- a/test.py +++ b/test.py @@ -1 +1,5 @@ -print('hello world') +import urllib.parse + +f = '25 Pandas Create A Matplotlib Scatterplot From A Dataframe ' +ff = urllib.parse.quote_plus(f) +print(ff.replace('+', '_')) \ No newline at end of file