Mining the Social Web, 3rd Edition

Appendix C: Python and Jupyter Notebook Tips & Tricks

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from Mining the Social Web (3rd Edition). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.

In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, you can find the full source code repository here.

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the Simplified BSD License that governs its use.

Try tinkering around with Python to test out Jupyter Notebook...

As you work through these notebooks, it's assumed that you'll be executing each cell in turn, because some cells will define variables that cells below them will use. Here's a very simple example to illustrate how this all works...

In [ ]:
# Execute this cell to define this variable
# Either click Cell => Run from the menu or type
# ctrl-Enter to execute. See the Help menu for lots
# of useful tips. Help => IPython Help and
# Help => Keyboard Shortcuts are especially
# useful.

message = "I want to mine the social web!"
In [ ]:
# The variable 'message is defined here. Execute this cell to see for yourself
print(message)
In [ ]:
# The variable 'message' is defined here, but we'll delete it
# after displaying it to illustrate an important point...
print(message)
del message
In [ ]:
# The variable message is no longer defined in this cell or two cells 
# above anymore. Try executing this cell or that cell to see for yourself.
print(message)
In [ ]:
# Try typing in some code of your own!

Python Idioms

This section of the notebook introduces a few Python idioms that are used widely throughout the book that you might find very helpful to review. This section is not intended to be a Python tutorial. It is intended to highlight some of the fundamental aspects of Python that will help you to follow along with the source code, assuming you have a general programming background. Sections 1 through 8 of the Python Tutorial are what you should spend a couple of hours working through if you are looking for a gentle introduction Python as a programming language.

Python Data Structures are like JSON

If you come from a web development background, a good starting point for understanding Python data structures is to start with JSON as a reference point. If you don't have a web development background, think of JSON as a simple but expressive specification for representing arbitrary data structures using strings, numbers, lists, and dictionaries. The following cell introduces some data structures. Execute the following cell that illustrates these fundamental data types to follow along.

In [ ]:
an_integer = 23
print(an_integer, type(an_integer))
print()

a_float = 23.0
print(a_float, type(a_float))
print()

a_string = "string"
print(a_string, type(a_string))
print()

a_list = [1,2,3]
print(a_list, type(a_list))
print(a_list[0]) # access the first item
print()

a_dict = {'a' : 1, 'b' : 2, 'c' : 3}
print(a_dict, type(a_dict))
print(a_dict['a']) # access the item with key 'a'

Assuming you've followed along with these fundamental data types, consider the possiblities for arbitrarily composing them to represent more complex structures:

In [ ]:
contacts = [
    {
      'name'      : 'Bob',
      'age'       : 23,
      'married'   : False,
      'height'    : 1.8, # meters
      'languages' : ['English', 'Spanish'],
      'address'   : '123 Maple St.',
      'phone'     : '(555) 555-5555'
    },
    
    {'name'      : 'Sally',
     'age'       : 26,
     'married'   : True,
     'height'    : 1.5, # meters
     'languages' : ['English'],
     'address'   : '456 Elm St.',
     'phone'     : '(555) 555-1234'
    }              
]

for contact in contacts:
    print("Name:", contact['name'])
    print("Married:", contact['married'])
    print()

As alluded to previously, the data structures very much lend themselves to constructing JSON in a very natural way. This is often quite convenient for web application development that involves using a Python server process to send data back to a JavaScript client. The following cell illustrates the general idea.

In [ ]:
import json

print(contacts)
print(type(contacts)) # list

# json.dumps pronounced (dumps stands for "dump string") takes a Python data structure
# that is serializable to JSON and dumps it as a string
jsonified_contacts = json.dumps(contacts, indent=2) # indent is used for pretty-printing

print(type(jsonified_contacts)) # str
print(jsonified_contacts)

A couple of additional types that you'll run across regularly are tuples and the special None type. Think of a tuple as an immutable list and None as a special value that indicates an empty value, which is neither True nor False.

In [ ]:
a_tuple = (1,2,3)

an_int = (1) # You must include a trailing comma when only one item is in the tuple

a_tuple = (1,)

a_tuple = (1,2,3,) # Trailing commas are ok in tuples and lists 

none = None

print(none == None)   # True
print(none == True)   # False
print(none == False)  # False

print()

# In general, you'll see the special 'is' operator used when comparing a value to 
# None, but most of the time, it works the same as '=='

print(none is None)  # True
print(none is True)  # False
print(none is False) # False

As indicated in the python.org tutorial, None is often used as a default value in function calls, which are defined by the keyword def

In [ ]:
def square(x):
    return x*x

print(square(2)) # 4
print()

# The default value for L is only created once and shared amongst
# calls

def f1(a, L=[]):
    L.append(a)
    return L

print(f1(1)) # [1]
print(f1(2)) # [1, 2]
print(f1(3)) # [1, 2, 3]
print()

# Each call creates a new value for L

def f2(a, L=None):
    if L is None:
        L = []
    L.append(a)
    return L

print(f2(1)) # [1]
print(f2(2)) # [2]
print(f2(3)) # [3]

List and String Slicing

For lists and strings, you'll often want to extract a particular selection using a starting and ending index. In Python, this is called slicing. The syntax involves using square brackets in the same way that you are extracting a single value, but you include an additional parameter to indicate the boundary for the slice.

In [ ]:
a_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

print(a_list[0])     # a
print(a_list[0:2])   # ['a', 'b']
print(a_list[:2])    # Same as above. The starting index is implicitly 0
print(a_list[3:])    # ['d', 'e', 'f', 'g'] Ending index is implicitly the length of the list
print(a_list[-1])    # g Negative indices start at the end of the list
print(a_list[-3:-1]) # ['e', 'f'] Start at the end and work backwards. (The index after the colon is still excluded)
print(a_list[-3:])   # ['e', 'f', 'g']  The last three items in the list
print(a_list[:-4])   # ['a', 'b', 'c'] # Everything up to the last 4 items

a_string = 'abcdefg'

# String slicing works the very same way

print(a_string[:-4]) # abc

List Comprehensions

Think of Python's list comprehensions idiom as a concise and efficient way to create lists. You'll often see list comprehensions used as an alternative to for loops for a common set of problems. Although they may take some getting used to, you'll soon find them to be a natural expression. See the section entitled "Loops" from Python Performance Tips for more details on some of the details on why list comprehensions may be more performant than loops or functions like map in various situations.

In [ ]:
# One way to create a list containing 0..9:

a_list = []
for i in range(10):
    a_list.append(i)
print(a_list)    
    
# How to do it with a list comprehension

print([ i for i in range(10) ])


# But what about a nested loop like this one, which
# even contains a conditional expression in it:

a_list = []
for i in range(10):
    for j in range(10, 20):
        if i % 2 == 0:
            a_list.append(i)

print(a_list)

# You can achieve a nested list comprehension to 
# achieve the very same result. When written with readable
# indention like below, note the striking similarity to
# the equivalent code as presented above.

print([ i
        for i in range(10)
            for j in range(10, 20)
                if i % 2 == 0
      ])

Dictionary Comprehensions

In the same way that you can concisely construct lists with list comprehensions, you can concisely construct dictionaries with dictionary comprehensions. The underlying concept involved and the syntax is very similar to list comprehensions. The following example illustrates a few different way to create the same dictionary and introduces dictionary construction syntax.

In [ ]:
# Literal syntax

a_dict = { 'a' : 1, 'b' : 2, 'c' : 3 }
print(a_dict)
print()

# Using the dict constructor

a_dict = dict([('a', 1), ('b', 2), ('c', 3)])
print(a_dict)
print()

# Dictionary comprehension syntax

a_dict = { k : v for (k,v) in [('a', 1), ('b', 2), ('c', 3)] }
print(a_dict)
print()

# A more appropriate circumstance to use dictionary comprehension would 
# involve more complex computation

a_dict = { k : k*k for k in range(10) } # {0: 0, 1: 1, 2: 4, 3: 9, ..., 9: 81}
print(a_dict)

Enumeration

While iterating over a collection such as a list, it's often handy to know the index for the item that you are looping over in addition to its value. While a reasonable approach is to maintain a looping index, the enumerate function spares you the trouble.

In [ ]:
lst = ['a', 'b', 'c']

# You could opt to maintain a looping index...
i = 0
for item in lst:
    print(i, item)
    i += 1

# ...but the enumerate function spares you the trouble of maintaining a loop index
for i, item in enumerate(lst):
    print(i, item)

*args and **kwargs

Conceptually, Python functions accept lists of arguments that can be followed by additional keyword arguments. A common idiom that you'll see when calling functions is to dereference a list or dictionary with the asterisk or double-asterisk, respectively, a special trick for satisfying the function's parameterization.

In [ ]:
def f(a, b, c, d=None, e=None):
    print(a, b, c, d, e)

f(1, 2, 3)              # 1 2 3 None None
f(1, 3, 3, d=4)         # 1 2 3 4 None
f(1, 2, 3, d=4, e=5)    # 1 2 3 4 5

args = [1,2,3]
kwargs = {'d' : 4, 'e' : 5}

f(*args, **kwargs)      # 1 2 3 4 5

String Substitutions

It's often clearer in code to use string substitution than to concatenate strings, although both options can get the job done. The string type's built-in format function is also very handy and adds to the readability of code. The following examples illustrate some of the common string substitutions that you'll regularly encounter in the code.

In [ ]:
name1, name2 = "Bob", "Sally"

print("Hello, " + name1 + ". My name is " + name2)

print("Hello, %s. My name is %s" % (name1, name2,))

print("Hello, {0}. My name is {1}".format(name1, name2))
print("Hello, {0}. My name is {1}".format(*[name1, name2]))
names = [name1, name2]
print("Hello, {0}. My name is {1}".format(*names))


print("Hello, {you}. My name is {me}".format(you=name1, me=name2))
print("Hello, {you}. My name is {me}".format(**{'you' : name1, 'me' : name2}))
names = {'you' : name1, 'me' : name2}
print("Hello, {you}. My name is {me}".format(**names))

Serving Static Content

IPython Notebook has some handy features for interacting with the web browser that you should know about. A few of the features that you'll see in the source code are embedding inline frames, and serving static content such as images, text files, JavaScript files, etc. The ability to serve static content is especially handy if you'd like to display an inline visualization for analysis, and you'll see this technique used throughout the notebook.

The following cell illustrates creating and embedding an inline frame and serving the static source file for this notebook, which is serialized as JSON data.

In [ ]:
from IPython.display import IFrame
from IPython.core.display import display

# IPython Notebook can serve files relative to the location of
# the working notebook into inline frames. Prepend the path 
# with the 'files' prefix

static_content = 'resources/appc-pythontips/hello.txt'

display(IFrame(static_content, '100%', '600px'))

Shared Folders

The Docker container maps the top level directory of your GitHub checkout (the directory containing README.md) on your host machine to its /home/jovyan/notebooks folder and automatically synchronizes files between the guest and host environments as an incredible convenience to you. This mapping and synchronization enables Jupyter Notebooks you are running on the guest machine to access files that you can conveniently manage on your host machine and vice-versa. For example, many of the scripts in Jupyter Notebooks may write out data files and you can easily access those data files on your host environment (should you desire to do so) without needing to connect into the virtual machine with an SSH session. On the flip side, you can provide data files to Jupyter Notebook, which is running on the guest machine by copying them anywhere into your top level GitHub checkout.

In effect, the top level directory of your GitHub checkout is automatically synchronized between the guest and host environments so that you have access to everything that is happening and can manage your source code, modified notebooks, and everything else all from your host machine. See docker-compose.yml for more details on how synchronized folders can be configured.

The following code snippet illustrates how to access files. Keep in mind that the code that you execute in this cell writes data to the Docker container, and it's Docker that automatically synchronizes it back to your guest environment. It's a subtle but important detail.

In [ ]:
import os

# The absolute path to the shared folder on the VM
shared_folder="/home/jovyan/notebooks"

# List the files in the shared folder
print(os.listdir(shared_folder))
print()

# How to read and display a snippet of the share/README.md file...
README = os.path.join(shared_folder, "README.md")
txt = open(README).read()
print(txt[:200])

# Write out a file to the guest but notice that it is available on the host
# by checking the contents of your GitHub checkout
f = open(os.path.join(shared_folder, "Hello.txt"), "w")
f.write("Hello. This text is written on the guest but synchronized to the host by Vagrant")
f.close()

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the following Simplified BSD License (also known as "FreeBSD License") that governs its use. Basically, you can do whatever you want with the code so long as you retain the copyright notice.

Copyright (c) 2018, Matthew A. Russell & Mikhail Klassen All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.

In [ ]: