This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from Mining the Social Web (3rd Edition). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.
In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, you can find the full source code repository here.
You are free to use or adapt this notebook for any purpose you'd like. However, please respect the Simplified BSD License that governs its use.
As you work through these notebooks, it's assumed that you'll be executing each cell in turn, because some cells will define variables that cells below them will use. Here's a very simple example to illustrate how this all works...
# Execute this cell to define this variable
# Either click Cell => Run from the menu or type
# ctrl-Enter to execute. See the Help menu for lots
# of useful tips. Help => IPython Help and
# Help => Keyboard Shortcuts are especially
# useful.
message = "I want to mine the social web!"
# The variable 'message is defined here. Execute this cell to see for yourself
print(message)
# The variable 'message' is defined here, but we'll delete it
# after displaying it to illustrate an important point...
print(message)
del message
# The variable message is no longer defined in this cell or two cells
# above anymore. Try executing this cell or that cell to see for yourself.
print(message)
# Try typing in some code of your own!
This section of the notebook introduces a few Python idioms that are used widely throughout the book that you might find very helpful to review. This section is not intended to be a Python tutorial. It is intended to highlight some of the fundamental aspects of Python that will help you to follow along with the source code, assuming you have a general programming background. Sections 1 through 8 of the Python Tutorial are what you should spend a couple of hours working through if you are looking for a gentle introduction Python as a programming language.
If you come from a web development background, a good starting point for understanding Python data structures is to start with JSON as a reference point. If you don't have a web development background, think of JSON as a simple but expressive specification for representing arbitrary data structures using strings, numbers, lists, and dictionaries. The following cell introduces some data structures. Execute the following cell that illustrates these fundamental data types to follow along.
an_integer = 23
print(an_integer, type(an_integer))
print()
a_float = 23.0
print(a_float, type(a_float))
print()
a_string = "string"
print(a_string, type(a_string))
print()
a_list = [1,2,3]
print(a_list, type(a_list))
print(a_list[0]) # access the first item
print()
a_dict = {'a' : 1, 'b' : 2, 'c' : 3}
print(a_dict, type(a_dict))
print(a_dict['a']) # access the item with key 'a'
Assuming you've followed along with these fundamental data types, consider the possiblities for arbitrarily composing them to represent more complex structures:
contacts = [
{
'name' : 'Bob',
'age' : 23,
'married' : False,
'height' : 1.8, # meters
'languages' : ['English', 'Spanish'],
'address' : '123 Maple St.',
'phone' : '(555) 555-5555'
},
{'name' : 'Sally',
'age' : 26,
'married' : True,
'height' : 1.5, # meters
'languages' : ['English'],
'address' : '456 Elm St.',
'phone' : '(555) 555-1234'
}
]
for contact in contacts:
print("Name:", contact['name'])
print("Married:", contact['married'])
print()
As alluded to previously, the data structures very much lend themselves to constructing JSON in a very natural way. This is often quite convenient for web application development that involves using a Python server process to send data back to a JavaScript client. The following cell illustrates the general idea.
import json
print(contacts)
print(type(contacts)) # list
# json.dumps pronounced (dumps stands for "dump string") takes a Python data structure
# that is serializable to JSON and dumps it as a string
jsonified_contacts = json.dumps(contacts, indent=2) # indent is used for pretty-printing
print(type(jsonified_contacts)) # str
print(jsonified_contacts)
A couple of additional types that you'll run across regularly are tuples and the special None type. Think of a tuple as an immutable list and None as a special value that indicates an empty value, which is neither True nor False.
a_tuple = (1,2,3)
an_int = (1) # You must include a trailing comma when only one item is in the tuple
a_tuple = (1,)
a_tuple = (1,2,3,) # Trailing commas are ok in tuples and lists
none = None
print(none == None) # True
print(none == True) # False
print(none == False) # False
print()
# In general, you'll see the special 'is' operator used when comparing a value to
# None, but most of the time, it works the same as '=='
print(none is None) # True
print(none is True) # False
print(none is False) # False
As indicated in the python.org tutorial, None is often used as a default value in function calls, which are defined by the keyword def
def square(x):
return x*x
print(square(2)) # 4
print()
# The default value for L is only created once and shared amongst
# calls
def f1(a, L=[]):
L.append(a)
return L
print(f1(1)) # [1]
print(f1(2)) # [1, 2]
print(f1(3)) # [1, 2, 3]
print()
# Each call creates a new value for L
def f2(a, L=None):
if L is None:
L = []
L.append(a)
return L
print(f2(1)) # [1]
print(f2(2)) # [2]
print(f2(3)) # [3]
For lists and strings, you'll often want to extract a particular selection using a starting and ending index. In Python, this is called slicing. The syntax involves using square brackets in the same way that you are extracting a single value, but you include an additional parameter to indicate the boundary for the slice.
a_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
print(a_list[0]) # a
print(a_list[0:2]) # ['a', 'b']
print(a_list[:2]) # Same as above. The starting index is implicitly 0
print(a_list[3:]) # ['d', 'e', 'f', 'g'] Ending index is implicitly the length of the list
print(a_list[-1]) # g Negative indices start at the end of the list
print(a_list[-3:-1]) # ['e', 'f'] Start at the end and work backwards. (The index after the colon is still excluded)
print(a_list[-3:]) # ['e', 'f', 'g'] The last three items in the list
print(a_list[:-4]) # ['a', 'b', 'c'] # Everything up to the last 4 items
a_string = 'abcdefg'
# String slicing works the very same way
print(a_string[:-4]) # abc
Think of Python's list comprehensions idiom as a concise and efficient way to create lists. You'll often see list comprehensions used as an alternative to for
loops for a common set of problems. Although they may take some getting used to, you'll soon find them to be a natural expression. See the section entitled "Loops" from Python Performance Tips for more details on some of the details on why list comprehensions may be more performant than loops or functions like map
in various situations.
# One way to create a list containing 0..9:
a_list = []
for i in range(10):
a_list.append(i)
print(a_list)
# How to do it with a list comprehension
print([ i for i in range(10) ])
# But what about a nested loop like this one, which
# even contains a conditional expression in it:
a_list = []
for i in range(10):
for j in range(10, 20):
if i % 2 == 0:
a_list.append(i)
print(a_list)
# You can achieve a nested list comprehension to
# achieve the very same result. When written with readable
# indention like below, note the striking similarity to
# the equivalent code as presented above.
print([ i
for i in range(10)
for j in range(10, 20)
if i % 2 == 0
])
In the same way that you can concisely construct lists with list comprehensions, you can concisely construct dictionaries with dictionary comprehensions. The underlying concept involved and the syntax is very similar to list comprehensions. The following example illustrates a few different way to create the same dictionary and introduces dictionary construction syntax.
# Literal syntax
a_dict = { 'a' : 1, 'b' : 2, 'c' : 3 }
print(a_dict)
print()
# Using the dict constructor
a_dict = dict([('a', 1), ('b', 2), ('c', 3)])
print(a_dict)
print()
# Dictionary comprehension syntax
a_dict = { k : v for (k,v) in [('a', 1), ('b', 2), ('c', 3)] }
print(a_dict)
print()
# A more appropriate circumstance to use dictionary comprehension would
# involve more complex computation
a_dict = { k : k*k for k in xrange(10) } # {0: 0, 1: 1, 2: 4, 3: 9, ..., 9: 81}
print(a_dict)
While iterating over a collection such as a list, it's often handy to know the index for the item that you are looping over in addition to its value. While a reasonable approach is to maintain a looping index, the enumerate
function spares you the trouble.
lst = ['a', 'b', 'c']
# You could opt to maintain a looping index...
i = 0
for item in lst:
print(i, item)
i += 1
# ...but the enumerate function spares you the trouble of maintaining a loop index
for i, item in enumerate(lst):
print(i, item)
Conceptually, Python functions accept lists of arguments that can be followed by additional keyword arguments. A common idiom that you'll see when calling functions is to dereference a list or dictionary with the asterisk or double-asterisk, respectively, a special trick for satisfying the function's parameterization.
def f(a, b, c, d=None, e=None):
print(a, b, c, d, e)
f(1, 2, 3) # 1 2 3 None None
f(1, 3, 3, d=4) # 1 2 3 4 None
f(1, 2, 3, d=4, e=5) # 1 2 3 4 5
args = [1,2,3]
kwargs = {'d' : 4, 'e' : 5}
f(*args, **kwargs) # 1 2 3 4 5
It's often clearer in code to use string substitution than to concatenate strings, although both options can get the job done. The string type's built-in format function is also very handy and adds to the readability of code. The following examples illustrate some of the common string substitutions that you'll regularly encounter in the code.
name1, name2 = "Bob", "Sally"
print("Hello, " + name1 + ". My name is " + name2)
print("Hello, %s. My name is %s" % (name1, name2,))
print("Hello, {0}. My name is {1}".format(name1, name2))
print("Hello, {0}. My name is {1}".format(*[name1, name2]))
names = [name1, name2]
print("Hello, {0}. My name is {1}".format(*names))
print("Hello, {you}. My name is {me}".format(you=name1, me=name2))
print("Hello, {you}. My name is {me}".format(**{'you' : name1, 'me' : name2}))
names = {'you' : name1, 'me' : name2}
print("Hello, {you}. My name is {me}".format(**names))
Although you're probably better off to configure and use an SSH client to login to the virtual machine most of the time, you could even interact with the virtual machine almost as though you are working in a terminal session using Jupyter Notebook and the envoy package, which wraps the subprocess package in a highly convenient way that allows you to run arbitrary commands and see the results. The following script shows how to run a few remote commands on the virtual machine (where this Jupyter Notebook server would be running if you are using the virtual machine). Even in situations where you are running a Python program locally, this package can be of significant convenience.
import envoy # pip install envoy
# Run a command just as you would in a terminal
r = envoy.run('ps aux | grep jupyter') # show processes containing 'jupyter'
# Print its standard output
print(r.std_out)
# Print its standard error
print(r.std_err)
# Print the working directory for the IPython Notebook server
print(envoy.run('pwd').std_out)
# Try some commands of your own...
An alternative to using the envoy package to interact with the virtual machine through what is called "Bash Cell Magic" in IPython Notebook. The way it works is that if you write %%bash
on the first line of a cell, IPython Notebook will automatically take the remainder of the cell and execute it as a Bash script on the machine where the server is running. In case you come from a Windows background or are a Mac user who hasn't yet encountered Bash, it's the name of the default shell on most Linux systems, including the virtual machine that runs the IPython Notebook server.
Assuming that you are using the virtual machine, this means that you can essentially write bash scripts in IPython Notebook cells and execute them on the server. The following script demonstrates some of the possibilities, including how to use a command like wget
to download a file.
%%bash
# Print the working directory
pwd
# Display the date
date
# View the first 10 lines of a manual page for wget
man wget | head -10
# Download a webpage to /tmp/index.html
wget -qO /tmp/foo.html http://ipython.org/notebook.html
# Search for 'ipython' in the webpage
grep ipython /tmp/foo.html
Since Bash cell magic works just as though you were executing commands in a terminal, you can use it easily manage your source code by executing commands like "git status" and "git pull"
%%bash
ls ../
# Displays the status of the local repository
git status
# Execute "git pull" to perform an update
IPython Notebook has some handy features for interacting with the web browser that you should know about. A few of the features that you'll see in the source code are embedding inline frames, and serving static content such as images, text files, JavaScript files, etc. The ability to serve static content is especially handy if you'd like to display an inline visualization for analysis, and you'll see this technique used throughout the notebook.
The following cell illustrates creating and embedding an inline frame and serving the static source file for this notebook, which is serialized as JSON data.
from IPython.display import IFrame
from IPython.core.display import display
# IPython Notebook can serve files relative to the location of
# the working notebook into inline frames. Prepend the path
# with the 'files' prefix
static_content = 'resources/appc-pythontips/hello.txt'
display(IFrame(static_content, '100%', '600px'))
The Docker container maps the top level directory of your GitHub checkout (the directory containing README.md
) on your host machine to its /home/jovyan/notebooks
folder and automatically synchronizes files between the guest and host environments as an incredible convenience to you. This mapping and synchronization enables Jupyter Notebooks you are running on the guest machine to access files that you can conveniently manage on your host machine and vice-versa. For example, many of the scripts in Jupyter Notebooks may write out data files and you can easily access those data files on your host environment (should you desire to do so) without needing to connect into the virtual machine with an SSH session. On the flip side, you can provide data files to Jupyter Notebook, which is running on the guest machine by copying them anywhere into your top level GitHub checkout.
In effect, the top level directory of your GitHub checkout is automatically synchronized between the guest and host environments so that you have access to everything that is happening and can manage your source code, modified notebooks, and everything else all from your host machine. See docker-compose.yml for more details on how synchronized folders can be configured.
The following code snippet illustrates how to access files. Keep in mind that the code that you execute in this cell writes data to the Docker container, and it's Docker that automatically synchronizes it back to your guest environment. It's a subtle but important detail.
import os
# The absolute path to the shared folder on the VM
shared_folder="/home/jovyan/notebooks"
# List the files in the shared folder
print(os.listdir(shared_folder))
print()
# How to read and display a snippet of the share/README.md file...
README = os.path.join(shared_folder, "README.md")
txt = open(README).read()
print(txt[:200])
# Write out a file to the guest but notice that it is available on the host
# by checking the contents of your GitHub checkout
f = open(os.path.join(shared_folder, "Hello.txt"), "w")
f.write("Hello. This text is written on the guest but synchronized to the host by Vagrant")
f.close()
You are free to use or adapt this notebook for any purpose you'd like. However, please respect the following Simplified BSD License (also known as "FreeBSD License") that governs its use. Basically, you can do whatever you want with the code so long as you retain the copyright notice.
Copyright (c) 2018, Matthew A. Russell & Mikhail Klassen All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.