% Note that the a4paper option is mainly intended so that authors in
% countries using A4 can easily print to A4 and see how their papers will
% look in print - the typesetting of the document will not typically be
% affected with changes in paper size (but the bottom and side margins will).
% Use the testflow package mentioned above to verify correct handling of
% both paper sizes by the user's LaTeX system.
%
% Also note that the "draftcls" or "draftclsnofoot", not "draft", option
% should be used if it is desired that the figures are to be displayed in
% draft mode.
%
\documentclass[conference]{IEEEtran}
\usepackage{tabularx}
%\usepackage{hyperref}
\usepackage{listings}
\usepackage{lstmisc}
\usepackage{flushend}
%\usepackage{rotating}
\usepackage[english]{babel}
% Add the compsoc option for Computer Society conferences.
% Some very useful LaTeX packages include:
% *** MISC UTILITY PACKAGES ***
%
%\usepackage{ifpdf}
% Heiko Oberdiek's ifpdf.sty is very useful if you need conditional
% compilation based on whether the output is pdf or dvi.
% usage:
% \ifpdf
% % pdf code
% \else
% % dvi code
% \fi
% The latest version of ifpdf.sty can be obtained from:
% http://www.ctan.org/tex-archive/macros/latex/contrib/oberdiek/
% Also, note that IEEEtran.cls V1.7 and later provides a builtin
% \ifCLASSINFOpdf conditional that works the same way.
% When switching from latex to pdflatex and vice-versa, the compiler may
% have to be run twice to clear warning/error messages.
% *** CITATION PACKAGES ***
%
%\usepackage{cite}
% cite.sty was written by Donald Arseneau
% V1.6 and later of IEEEtran pre-defines the format of the cite.sty package
% \cite{} output to follow that of IEEE. Loading the cite package will
% result in citation numbers being automatically sorted and properly
% "compressed/ranged". e.g., [1], [9], [2], [7], [5], [6] without using
% cite.sty will become [1], [2], [5]--[7], [9] using cite.sty. cite.sty's
% \cite will automatically add leading space, if needed. Use cite.sty's
% noadjust option (cite.sty V3.8 and later) if you want to turn this off.
% cite.sty is already installed on most LaTeX systems. Be sure and use
% version 4.0 (2003-05-27) and later if using hyperref.sty. cite.sty does
% not currently provide for hyperlinked citations.
% The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/cite/
% The documentation is contained in the cite.sty file itself.
% *** GRAPHICS RELATED PACKAGES ***
%
\ifCLASSINFOpdf
\usepackage[pdftex]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../pdf/}{../jpeg/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
\DeclareGraphicsExtensions{.pdf,.jpeg,.png}
\else
% or other class option (dvipsone, dvipdf, if not using dvips). graphicx
% will default to the driver specified in the system graphics.cfg if no
% driver is specified.
\usepackage[dvips]{graphicx}
% declare the path(s) where your graphic files are
% \graphicspath{{../eps/}}
% and their extensions so you won't have to specify these with
% every instance of \includegraphics
\DeclareGraphicsExtensions{.eps}
\fi
% graphicx was written by David Carlisle and Sebastian Rahtz. It is
% required if you want graphics, photos, etc. graphicx.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/graphics/
% Another good source of documentation is "Using Imported Graphics in
% LaTeX2e" by Keith Reckdahl which can be found as epslatex.ps or
% epslatex.pdf at: http://www.ctan.org/tex-archive/info/
%
% latex, and pdflatex in dvi mode, support graphics in encapsulated
% postscript (.eps) format. pdflatex in pdf mode supports graphics
% in .pdf, .jpeg, .png and .mps (metapost) formats. Users should ensure
% that all non-photo figures use a vector format (.eps, .pdf, .mps) and
% not a bitmapped formats (.jpeg, .png). IEEE frowns on bitmapped formats
% which can result in "jaggedy"/blurry rendering of lines and letters as
% well as large increases in file sizes.
%
% You can find documentation about the pdfTeX application at:
% http://www.tug.org/applications/pdftex
% *** MATH PACKAGES ***
%
%\usepackage[cmex10]{amsmath}
% A popular package from the American Mathematical Society that provides
% many useful and powerful commands for dealing with mathematics. If using
% it, be sure to load this package with the cmex10 option to ensure that
% only type 1 fonts will utilized at all point sizes. Without this option,
% it is possible that some math symbols, particularly those within
% footnotes, will be rendered in bitmap form which will result in a
% document that can not be IEEE Xplore compliant!
%
% Also, note that the amsmath package sets \interdisplaylinepenalty to 10000
% thus preventing page breaks from occurring within multiline equations. Use:
%\interdisplaylinepenalty=2500
% after loading amsmath to restore such page breaks as IEEEtran.cls normally
% does. amsmath.sty is already installed on most LaTeX systems. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/amslatex/math/
% *** SPECIALIZED LIST PACKAGES ***
%
%\usepackage{algorithmic}
% algorithmic.sty was written by Peter Williams and Rogerio Brito.
% This package provides an algorithmic environment fo describing algorithms.
% You can use the algorithmic environment in-text or within a figure
% environment to provide for a floating algorithm. Do NOT use the algorithm
% floating environment provided by algorithm.sty (by the same authors) or
% algorithm2e.sty (by Christophe Fiorio) as IEEE does not use dedicated
% algorithm float types and packages that provide these will not provide
% correct IEEE style captions. The latest version and documentation of
% algorithmic.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithms/
% There is also a support site at:
% http://algorithms.berlios.de/index.html
% Also of interest may be the (relatively newer and more customizable)
% algorithmicx.sty package by Szasz Janos:
% http://www.ctan.org/tex-archive/macros/latex/contrib/algorithmicx/
% *** ALIGNMENT PACKAGES ***
%
%\usepackage{array}
% Frank Mittelbach's and David Carlisle's array.sty patches and improves
% the standard LaTeX2e array and tabular environments to provide better
% appearance and additional user controls. As the default LaTeX2e table
% generation code is lacking to the point of almost being broken with
% respect to the quality of the end results, all users are strongly
% advised to use an enhanced (at the very least that provided by array.sty)
% set of table tools. array.sty is already installed on most systems. The
% latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/required/tools/
%\usepackage{mdwmath}
%\usepackage{mdwtab}
% Also highly recommended is Mark Wooding's extremely powerful MDW tools,
% especially mdwmath.sty and mdwtab.sty which are used to format equations
% and tables, respectively. The MDWtools set is already installed on most
% LaTeX systems. The lastest version and documentation is available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/mdwtools/
% IEEEtran contains the IEEEeqnarray family of commands that can be used to
% generate multiline equations as well as matrices, tables, etc., of high
% quality.
%\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/eqparbox/
% *** SUBFIGURE PACKAGES ***
%\usepackage[tight,footnotesize]{subfigure}
% subfigure.sty was written by Steven Douglas Cochran. This package makes it
% easy to put subfigures in your figures. e.g., "Figure 1a and 1b". For IEEE
% work, it is a good idea to load it with the tight package option to reduce
% the amount of white space around the subfigures. subfigure.sty is already
% installed on most LaTeX systems. The latest version and documentation can
% be obtained at:
% http://www.ctan.org/tex-archive/obsolete/macros/latex/contrib/subfigure/
% subfigure.sty has been superceeded by subfig.sty.
\usepackage[caption=false]{caption}
\usepackage[font=footnotesize]{subfig}
% subfig.sty, also written by Steven Douglas Cochran, is the modern
% replacement for subfigure.sty. However, subfig.sty requires and
% automatically loads Axel Sommerfeldt's caption.sty which will override
% IEEEtran.cls handling of captions and this will result in nonIEEE style
% figure/table captions. To prevent this problem, be sure and preload
% caption.sty with its "caption=false" package option. This is will preserve
% IEEEtran.cls handing of captions. Version 1.3 (2005/06/28) and later
% (recommended due to many improvements over 1.2) of subfig.sty supports
% the caption=false option directly:
%\usepackage[caption=false,font=footnotesize]{subfig}
%
% The latest version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/subfig/
% The latest version and documentation of caption.sty can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/caption/
% *** FLOAT PACKAGES ***
%
%\usepackage{fixltx2e}
% fixltx2e, the successor to the earlier fix2col.sty, was written by
% Frank Mittelbach and David Carlisle. This package corrects a few problems
% in the LaTeX2e kernel, the most notable of which is that in current
% LaTeX2e releases, the ordering of single and double column floats is not
% guaranteed to be preserved. Thus, an unpatched LaTeX2e can allow a
% single column figure to be placed prior to an earlier double column
% figure. The latest version and documentation can be found at:
% http://www.ctan.org/tex-archive/macros/latex/base/
%\usepackage{stfloats}
% stfloats.sty was written by Sigitas Tolusis. This package gives LaTeX2e
% the ability to do double column floats at the bottom of the page as well
% as the top. (e.g., "\begin{figure*}[!b]" is not normally possible in
% LaTeX2e). It also provides a command:
%\fnbelowfloat
% to enable the placement of footnotes below bottom floats (the standard
% LaTeX2e kernel puts them above bottom floats). This is an invasive package
% which rewrites many portions of the LaTeX2e float routines. It may not work
% with other packages that modify the LaTeX2e float routines. The latest
% version and documentation can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/sttools/
% Documentation is contained in the stfloats.sty comments as well as in the
% presfull.pdf file. Do not use the stfloats baselinefloat ability as IEEE
% does not allow \baselineskip to stretch. Authors submitting work to the
% IEEE should note that IEEE rarely uses double column equations and
% that authors should try to avoid such use. Do not be tempted to use the
% cuted.sty or midfloat.sty packages (also by Sigitas Tolusis) as IEEE does
% not format its papers in such ways.
% *** PDF, URL AND HYPERLINK PACKAGES ***
%
\usepackage{url}
\usepackage{color}
% url.sty was written by Donald Arseneau. It provides better support for
% handling and breaking URLs. url.sty is already installed on most LaTeX
% systems. The latest version can be obtained at:
% http://www.ctan.org/tex-archive/macros/latex/contrib/misc/
% Read the url.sty source comments for usage information. Basically,
% \url{my_url_here}.
% *** Do not adjust lengths that control margins, column widths, etc. ***
% *** Do not use packages that alter fonts (such as pslatex). ***
% There should be no need to do such things with IEEEtran.cls V1.6 and later.
% (Unless specifically asked to do so by the journal or conference you plan
% to submit to, of course. )
% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor}
%!!! Ha elfelejteném, tedd vissza a \figs-t.
%\newcommand{\figs}{{/home/zoltan/phd/arpi/networks/cn-edu}}
\newcommand{\figs}{{./eps}}
\newcommand{\deb}{{\em deb}}
\newcommand{\apt}{{\tt apt}}
\newcommand{\kin}{$k_{\rm in}$}
\newcommand{\kout}{$k_{\rm out}$}
\newcommand{\htmladdnormallink}[2]{{#1}}
\begin{document}
\title{Complex networks in the higher education}
\author{
\IEEEauthorblockN{\'Arp\'ad Horv\'ath}
\IEEEauthorblockA{Budapest Tech,
Center for Innovation and Education\\
H-8000 Sz\'ekesfeh\'erv\'ar, Budai \'ut 45., Hungary \\
Email: horvath.arpad@roik.bmf.hu}
\IEEEauthorblockA{and}
\IEEEauthorblockN{Zolt\'an Tr\'ocs\'anyi}
\IEEEauthorblockA{University of Debrecen and Institute of Nuclear
Research of the Hungarian Academy of Sciences\\
H-4001 Debrecen P.O.Box 51, Hungary\\
Email: Z.Trocsanyi@atomki.hu}
}
\maketitle
\definecolor{DarkRed}{rgb}{0.55,0.0,0.0}
\definecolor{LightRed}{rgb}{0.9,0.8,0.8}
\lstset{language=python, extendedchars=true, basicstyle=\small,
keywordstyle=\color{DarkRed}\bfseries,
%backgroundcolor=\color{LightRed}
frame=single
} %, numbers=left, numberstyle=\normalsize, stepnumber=5, numbersep=5pt, firstnumber=10}
\begin{abstract}
%\boldmath
Recent investigations in the field of complex networks include the
analysis of distributions of connections in particular networks and the
clustering properties of networks. In the first part of my presentation
I analyse a network: the dependency graph of the software packages
of the Ubuntu distribution of GNU/Linux. It is a directed acyclic
graph, and I analyse the degree distribution of the graph for
in-degrees (connections directed into a node, the dependent packages),
out-degrees (connections directed away from a node) and plain-degrees.
I also analyse the clustering properties of the graph.
In the second part, I discuss the teaching of these topics at the
college level. Using the NetworkX modul for the Python programming
language students can easily investigate these properties of the
package dependence graph.
\end{abstract}
\section{Introduction}
Recent investigations of complex networks \cite{Barabasi-2002} resulted in
many interesting observations and discoveries that have potentionally
important applications both in social and life sciences. The analytical
skills required for such investigations do not involve high-level
mathematics, therefore, the new concepts can fairly easily be
transferred to the general education.
Computers are popular among young people and are readily available almost
everywhere. In order to bring the studies of networks into the classroom,
we have choosen a graph, where the nodes are software packages of a
freely available operating system (Linux) and the links are dependences
among them. The level of the required skills for such investigations
are at the college-level, but can be used in special classes in
secondary schools as well.
%I have choosen a graph to perform such a invetigations with, wich can be
%a part of the education. I planned it for college students, but most of
%them can be used on a special course of a secondary school as well. I
%We used free and mostly crossplatform software components. The biggest part
%of the paper is the description of the results we can achieve with this
%tools, always with the code with we can reach it. I do not write about
%the syntax of the Python language, it has \htmladdnormallink{good
%tutorial as a part of its documentation}{http://python.org/doc}. I hope
%the codes are easy to understand.
\section{Dependences of Linux packages and Python analysis software}
%\section{About Linux packages}
There are two well-known package formats for Linux: the \deb~and {\em rpm}
packages, compressed archives of the files of a software compiled for a
given architecture with some metadata. The {\deb}s have been created for
Debian and been used in Linux distributions like Debian \cite{debian} and Ubuntu \cite{ubuntu}. The
\emph{rpm}s have been created by Red Hat Inc. for the Red Hat Linux and
are used in some other distributions including Mandriva, SuSE, Fedora as
well. In this presentation we concentrate on {\deb}s as provided in the
Ubuntu 8.04 distribution.
The \apt~package management tool, used mainly for the \deb~packages,
is a collection of programs to
help install, remove or upgrade packages. The \deb~files are accessed
using optical discs or repositories on the Internet and
describe the dependencies, so -- if we want to install a package --
\apt\ knows which other packages need to be installed in addition.
We can also use the \apt~program
to get the main properties of packages with the {\tt apt-cache} command.
For instance, the \texttt{apt-cache show vim} command gives the following output:\\[.5 ex]
{\small
Package: vim\\
Priority: optional\\
Provides: editor\\
Depends: libc6 ($>$= 2.7-1), libgpmg1 ($>$= 1.19.6-1), libncurses5 ($>$= 5.6+20071006-3), python2.5 ($>$= 2.5), vim-common (= 1:7.1-138+1ubuntu3), vim-runtime (= 1:7.1-138+1ubuntu3)\\
Filename: pool/main/v/vim/vim\_7.1-138+1ubuntu3\_i386.deb\\
Description: Vi IMproved - enhanced vi editor\\
Origin: Ubuntu
} %small
From this we can see, which version of other packages the {\tt vim} package
depends on. All packages have an attribute called priority, which can be
required, important, standard, optional and extra; in this case
optional.
The packages for Ubuntu can be in three
repositories: main, universe and multiverse. In our analyis we include files from all. From the row starting with
Filename we can conclude, that {\tt vim} is in the main repository and was
compiled for Intel 386 architecture.
%In the Depends row not only real packages are allowed, but there may occur virtual packages,
%e.g. editor. In that case the package needs one of the packages that
%provides the editor virtual package, for example vim, emacs or nano.
%\section{The software we use}
Python is an object oriented programming language with an
extended standars library and some useful modules for science \cite{scipy}:
\begin{itemize}
\item NumPy for matrix calculations,
\item Scipy for scientific tasks, for example for optimization and
fitting data,
\item NetworkX for invesigating network (or graph) properties,
\item Matplotlib (and its part pylab), a plotting tool.
\end{itemize}
We use the IPython, one of the interactiv shells for Python.
%With this we can write a Python code and run it from the command prompt.
%like in MATLAB.
It has some very convenient features: for example, running from command prompt, automatic identation,
input and output history and interactive plotting of diagrams and graphs.
We wrote a program called \texttt{packages.py} to manipulate and analyse the data
of the packages. It is based on the {\tt NetworkX} \cite{networkx}, \apt, {\tt apt\_pkg} and {\tt SciPy}
modules. These programs are platform independent
%The programs I mentioned in this section are all platform independent
except the \apt~and {\tt apt\_pkg} modules, that generate the
dependency graph from the Ubuntu package database. On other platforms (e.g.~Windows or MacOS) we can investigate saved graphs.
One can download and analyse saved graphs from the webpage
\url{http://mail.roik.bmf.hu/num/complex_networks/packages}
as \texttt{ubuntu\_packages*.dot}, where * can be nothing or the date of
creation.
%So we can deal with the graphs on Windows or MacOS platforms as well.
The \texttt{packages.py} file can be downloaded from the same website.
\section{Basic properties of the dependency graph}
%I investigated the packages of the Ubuntu distribution of
%GNU/Linux. I used the version 8.04 (Hardy Heron) of Ubuntu.
%I analysed the packages from the repositories main, universe and multiverse.
The dependency graph is directed. If package $A$ depends on $B$, then the
$(A,B)$ edge is member of the graph. One package in the graph has successors
and predecessors, namely the packages it depends on and the packages
that depend on it. In Fig.~\ref{vim_graph} we can see the vim
package with its successors and predecessors.
The first row lists the packages needed by the \texttt{vim} package:
\verb+['vim-vimoutliner', 'vim-latexsuite']+.
\begin{figure}[!t]
\centering
\includegraphics[width=\linewidth]{\figs/vim-neighbors2}
\caption{A small part of the dependency graph which includes the vim
package and the predecessors and successors of vim.}
\label{vim_graph}
\end{figure}
The graph is not acyclic because it has some pairs of packages depending
on each other. As of July 2008, it has 657 such pairs. Among these 559
pairs are language packages, such as\\
language-pack-sr-base $\leftrightarrow$ language-pack-sr\\
language-pack-kde-sr $\leftrightarrow$ language-pack-kde-sr-base\\
openoffice.org-hyphenation-sr $\leftrightarrow$
language-support-writing-sr
\section{Degree distribution}
The simplest model for a complex network is the Erd\H os--R\'enyi model
defined as $N$ labeled nodes connected by $n$ edges, which are chosen
randomly from the set of $N(N-1)/2$ possible edges. Each possible edge is
connected with the probability $p$.
In this model the degree distribution has a peak at the value $pN$.
Real networks -- like the internet and citation networks --
differ from this model in their
degree distribution. Instead of having a peak, these networks show a so
called scale-free distribution characterised by a power-law
function \cite{albert-barabasi-2002},%\cite[p. 49]{albert-barabasi-2002}:
\begin{equation}
P(k)\sim k^{-\gamma}.
\end{equation}
In the case of our dependency graph, we plotted various probability
density distributions ($p(k_{\rm in})$ for in-, $p(k_{\rm out})$ for out- and
$p(k)$ for plain-degree) in Figure \ref{fig:deg_dist}. The probability
distribution $p(k)$ times the bin width, $p(k) (k_{i+1}-k_i)$ for $k
\in [k_i, k_{i+1}]$, gives the probability that the degree of a
randomly chosen connection falls in the range $[k_i, k_{i+1}]$ (of
course, only integer degrees are possible and not all values are
actually realized in a given graph). The blue plus marks show the
distributions with unit size bins, $k_{i+1}-k_i = 1$. At a glance,
the in-degree distribution exhibits a heavy-tailed power-law shape. In
the case of the out-degree distribution there is a kink at \kout
$\simeq 30$, so it is not clear whether the distribution is best
described by a single power-law distribution, or by two distributions
with different coefficients. The same feature can also be observed on
the plain-degree distribution, although it is suppressed by the effect
of in-degrees. To make a decision about the best power-law
description, one should make a decision about the fit-range. However,
at the tail of the diagram (for large $k$) there is a lot of degrees
with no nodes having that degree. Using logarithmic scales on both
axes, the zero values are not visible on the plot. Furthermore, there
are many degrees that occur only once or twice. Thus an arbitrary
truncation of the fit-range results in a large systematic uncertainty
in the fitting of the $\gamma$ power.
There are known ways to get around this problem \cite{newman-2003-45}.
In this analysis we chose to increase the bin size towards the tail of
the distribution (bin smearing) such that if the occurences of
three degree values $k_i$, $k_j$ and $k_l$ are non-zero, but there is
at least one degree with zero occurence between $k_i$ and $k_j$, as
well as between $k_j$ and $k_l$ and no other degree values with
non-zero occurence, then we used the geometric means $k_{j\min} =
\sqrt{k_i k_j}$ and $k_{j\max} = \sqrt{k_j k_l}$ to define the bin
boundaries around the value $k_j$. Furthermore, we set the probability
density value $p(k_j)$ to $p(k_j)/(k_{j\max} - k_{j\min})$. These
values are plotted with red crosses.
We fitted a power-law function to smeared distributions (dashed line).
The exponent is -1.79 for in-degree, -2.88 for out-degree and -2.11 for
the plain-degree. For in-degree and plain-degree the fitting is very
good, the $\chi^2$/d.o.f. is below 1\,\% for both cases, while it is much
larger, about 7\,\% for out-degree. The higher value for the latter can
be understood from the plot: there is a visible kink in the distribution
at \kout$\simeq 30$, so the fitted function differs from the data
for almost all degrees. We have also carried out fits with different
power-law functions for small and large \kout~values, which
resulted in much smaller $\chi^2$/d.o.f. (0.5\,\% for small and 3.3\,\%
for large values) and very different exponents, -1.87 and -4.79 for small
and large values of \kout, respectively.
Similar investigations to ours were carried out in
Ref.~\cite{labelle-2004}, where substaintially smaller exponents were
found. Although, the network in that study was different from ours,
nevertheless, we attribute the difference in the results mainly to the
use of unsmeared bins in \cite{labelle-2004}, which yields high
sensitivity to the fit range as discussed above.
\begin{figure}%
\centering
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/degree_dist_fitted_in}}\\
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/degree_dist_fitted_out}}\\
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/degree_dist_fitted_in_out}}
\caption{Probability distributions for (a) in-degree, (b) out-degree,
(c) plain-degree (sum of in- and out-degrees).
The blue `+' marks correspond to the original, the red `$\times$' to the smeared distribution
(see text).}%
\label{fig:deg_dist}%
\end{figure}
The highest out-degree is 161 (for icthux-desktop), while highest
in-degree is 11114 (for libc6 package). We listed other packages with
large in- or out-degrees in table \ref{tab:degree}.
\begin{table}
\begin{tabularx}{\linewidth}{|r|l|X|}
\hline
\kin &package&description\\
\hline
11113 &libc6 &GNU C Library: Shared libraries\\
3230 &libgcc1 &GCC support library\\
3109 &libstdc++6 &The GNU Standard C++ Library\\
2696 &libx11-6 &X11 client-side library\\
1985 &libglib2.0-0 &The GLib library of C routines\\
1940 &zlib1g &Compression library - runtime\\
1929 &perl &Larry Wall's Practical Extraction and Report Language\\
1865 &libxext6 &X11 miscellaneous extension library\\
1381 &libgtk2.0-0 &The GTK+ graphical user interface library\\
1296 &python &An interactive high-level object-oriented language\\
1279 &libpango1.0-0 &Layout and rendering of internationalized text\\
1226 &libcairo2 &The Cairo 2D vector graphics library\\
1217 &libatk1.0-0 &The ATK accessibility toolkit\\
1177 &dpkg &Package maintenance system for Debian\\
\hline
\kout &package&description\\
\hline
161 &ichthux-desktop &Desktop for Christians\\
120 &ubuntu-desktop &Ubuntu with GNOME desktop environment\\
117 &gobuntu-desktop &\\
113 &ubuntustudio-desktop&\\
103 &texlive-full &Meta package pulling in all components of TeX Live\\
87 &kubuntu-desktop &Ubuntu with KDE desktop environment\\
83 &xubuntu-desktop &Ubuntu with XFCE desktop environment\\
77 &ubuntu-minimal &\\
74 &rhythmbox &Music player for GNOME\\
\hline
\end{tabularx}
\caption{The packages with largest in-degrees (\kin $>1150$) and with
largest out-degrees (\kout $>70$).}
\label{tab:degree}%
\end{table}
\section{Clustering coefficient}
In real networks, like the internet,
neighbouring nodes are more likely connected than other
two nodes chosen randomly.
There is a measure for this property of the graph, called the clustering
coefficient. In an undirected simple graph (a graph without multiple and loop
edges), if a selected node $i$ has $k_i$ neighbours, then between the
$k_i$ neighbours can be no more, than $k_i(k_i-1)/2$ edges. We define
the clustering coefficient $C_i$ of a node, as:
\begin{equation}
C_i=\frac {2E_i}{k_i(k_i-1)},
\end{equation}
where $E_i$ the actual number of edges between the neighbours
\cite{albert-barabasi-2002}.
%\cite[p. 49]{albert-barabasi-2002}. If $C_i=1$,
If $C_i=1$, all pairs of the neighbours are connected.
%
For the whole graph we can define a $C$ clustering coefficient as the
average of the $C_i$-s for all the nodes in the graph.
%
Directed graphs are generally transformed to undirected to calculate
clustering coefficients.
In real-world networks $C$ is much greater
than in the Erd\H{o}s--R\'enyi model, and $C_i$ falls off with $k_i$,
proportional to $k^{-1}$ \cite{ravasz-2003-67}.
\begin{figure}%
\centering
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/cc_degree_diagram_required}}\\
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/cc_degree_diagram_required+important}}\\
\subfloat[][]{\includegraphics[width=\linewidth]{\figs/cc_degree_diagram_all}}
\caption{The average of clustering coefficient as a function of
degree. (a) for the packages with required priority ($\gamma=0.888$), (b)
for the packages both from required and important priority ($\gamma=0.967$),
(c) for all the packages we investigated ($\gamma=0.860$).}%
\label{fig:cc_deg}%
\end{figure}
We investigated the clustering coefficient in the whole graph and its two
subgraphs. The result is shown in Figure (\ref{fig:cc_deg}).
The first subgraph of 69 nodes contained packages with ``required'' priority
(a). Only the libc6 packages had 54 neighbours, the others had less then
9. The second subgraph contained all 161 packages with ``required'' or
``important'' priority (b). Here the package libc6 had 124 neighbours,
ubuntu-minimal had 77 neighbours and all of the others had less than 17
neighbours. The clustering coefficient as a function of degree in those
two cases are close to a power law function with $0.8<\gamma<1$. For the
last graph, which contained all of the investigated packages (c), for
small degrees the exponent is much smaller then for high degrees, sharing
this property with some other real networks \cite{ravasz-2003-67}.
%\cite[Figure 3]{ravasz-2003-67}.
\section{Distance between packages}
\label{near}
In the previous section, we considered packages being neighbouring
if there is a connection between them. We can however, use another
measure for defining the nodes to be neighbours.
For each package $p_i$, we collect its predecessors
into a set $s_i$. For each pair of packages $p_i$ and $p_j$, we mark
the the cardinality of the union of the sets $s_i$ and $s_j$ with
$|U_{ij}|$ and that of their intersection with $|I_{ij}|$. Then we
define a distance measure $d_{ij}$ of the two packages as
\begin{equation}
d_{ij}=\frac{|I_{ij}|}{|U_{ij}|}.
\end{equation}
If two packages have the same predecessors $d_{ij}=1$, if the sets of
predecessors are disjoint $d_{ij}=0$. We have searched for packages having
many predecessors, with large $d_{ij}$ (Table \ref{tab:dij}).
\begin{table}
\begin{tabularx}{\linewidth}{|X|X|r|r|}
\hline
Package 1 & package 2 & $N_{pmax}$ & $d_{ij}$ \\
\hline
\hline
libatk1.0-0 & libpango1.0-0 & 1280 & 0.917 \\
\hline
libcairo2 & libpango1.0-0 & 1280 & 0.904 \\
\hline
libatk1.0-0 & libcairo2 & 1226 & 0.907 \\
\hline
libice6 & libsm6 & 1134 & 0.990 \\
\hline
libxrandr2 & libxcursor1 & 906 & 0.903 \\
\hline
libgl1-mesa-glx & libgl1 & 363 & 0.922 \\
\hline
libxcomposite1 & libxdamage1 & 332 & 0.938 \\
\hline
libqt4-core & libqt4-gui & 310 & 0.916 \\
\hline
libglu1-mesa & libglu1 & 287 & 0.982 \\
\hline
kde-icons-oxygen & kdebase-runtime & 174 & 0.982 \\
\hline
kde-icons-oxygen & kdebase-runtime-data & 173 & 1.0 \\
\hline
kdebase-runtime & kdebase-runtime-data & 174 & 0.982 \\
\hline
libesd0 & libesd-alsa0 & 108 & 0.990 \\
\hline
gnustep-gpbs & gnustep-back0.12 & 59 & 0.983 \\
\hline
gnustep-gpbs & gnustep-gui-runtime & 62 & 0.935 \\
\hline
gnustep-back0.12 & libgnustep-gui0.12 & 65 & 0.907 \\
\hline
gnustep-back0.12 & gnustep-gui-runtime & 62 & 0.920 \\
\hline
libgnustep-gui0.12 & gnustep-gui-runtime & 65 & 0.953 \\
\hline
libgnustep-base1.14 & gnustep-base-runtime & 70 & 0.971 \\
\hline
roxen2 & roxen & 66 & 0.924 \\
\hline
libblas3gf & libblas.so.3gf & 72 & 0.958 \\
\hline
liblapack3gf & liblapack.so.3gf & 61 & 0.950 \\
\hline
\end{tabularx}
\caption{The package pairs with $d_{ij}>0.9$ and $N_{pmax}>50$. ($N_{pmax}$ is
the maximum of the number of predecessors for the two packages.)
}
\label{tab:dij}%
\end{table}
In the first three rows of the table there is a triple of packages
sharing almost all of their predecessors. These are very close together in
this sense, however only the packages libpango and libcairo are connected
directly (libatk is not connected).
\begin{figure}%
\includegraphics[width=\linewidth]{\figs/gnustep_circo}
\caption{The subgraph of the six gnustep packages.}%
\label{fig:gnustep}%
\end{figure}
There is another six rows for packages in connection with gnustep. These
share many common predecessors as well. Five of them are connected with
edges, but one (gnustep-back0.12) has no connection with them
(Figure \ref{fig:gnustep}).
In studying the clustering coefficient, we considered nodes that were
directly connected.
Introducing the distance $d_{ij}$ creates a new possibility to consider
nodes to be neighours
and calculate the clustering coefficient accordingly.
We did not yet pursue this study further.
\section{Bringing the investigation of networks into the classroom}
In Section~II. we listed the software tools that we used for analysing
the properties of the dependency graph. These tools can be used
sufficiently easily, so that classrooms excercises can be devised for
introducing the properties of complex networks.
%TODO How to install the software $\rightarrow$ Write
The installation of the necessary software is described on the home page
(presently in Hungarian)
mail.roik.bmf.hu/num/complex\_networks/doc/install\_hu.html.
In the following we assume Ubuntu platform, the necessary software is
installed, the \texttt{packages} directory is the current directory, we started the
IPython interpreter with the \texttt{pylab} option (\texttt{ipython -pylab}) and typed the rows below (Python
keywords are {\color{DarkRed}\bfseries highlighted}):
\begin{lstlisting}
import packages
import networkx
\end{lstlisting}
%\section{Basic actions with the graph}
%We have three possibilities to get a package dependency graph.
If the \apt~and {\tt apt\_get} modules are installed, we can get the
package dependency graph from the cache of the \apt~program.
%These two modules are quite easy to install
%on Linux systems, which use deb packages (Ubuntu, Debian).
The commands
\begin{lstlisting}
G=packages.get_graph()
print G.info()
\end{lstlisting}
result in the following list of information:
%Then we get the graph and have some information of main properties:
{\small
\begin{verbatim}
Name: Ubuntu package dependency...
Type: MyDiGraph
Number of nodes: 25536
Number of edges: 121316
Average degree: 9.5016
\end{verbatim}
} %small
On platforms that do not have the \apt~package, we can get a graph from the saved \texttt{ubuntu-packages.dot} file skipping the previous step (a text
file which uses the \htmladdnormallink{DOT
language}{http://www.graphviz.org/Documentation.php}) as well:
\begin{lstlisting}
G=packages.get_graph('ubuntu-packages.dot')
\end{lstlisting}
%Alternatively, we can also get the graph from a shelve file of Python.
%EZT NEM ERTEM.
We can archive a graph with the \texttt{save\_graph} method of the
object:
\begin{lstlisting}
G.save_graph('2008-07-01.dot')
\end{lstlisting}
We can select some of the nodes. For instance, the command
\begin{lstlisting}
[node for node in G if 'vim' in node]
\end{lstlisting}
produces the following list (the original list is longer):
{\small
\begin{verbatim}
['vim-vimoutliner', 'vim-latexsuite',
'vim', 'vim-common', 'vim-full', 'vim-gtk']
\end{verbatim}
} %small
To identify the predecessors and successors of a certain package, for
instance {\tt vim}, we can use the following NetworkX commands:
\begin{lstlisting}
G.predecessors('vim')
G.successors('vim')
\end{lstlisting}
which was used in creating the graph in Fig.~1.
We can also use the program to draw the diagram of the degree
distribution for the package dependency graph. For instance, the code
\begin{lstlisting}
degrees = G. degree()
packages.degree_distribution(degrees,
direction = 'all',
savefig_file = 'deg_dist.png')
\end{lstlisting}
was used to obtain the graphs in Fig.~2.
The direction parameter in the last function may be set to 'in' or 'out' depending whether we want to
count only the number of predecessors or successors respectively,
`None' if we wish to analyse the sum of them, or 'all' if we want each these
diagrams. If the \texttt{savefig\_file} is
given, the function saves the plot to a file with the given name; eps, jpeg, pdf, png, ps
and svg file extensions are allowed. If the direction is `all', then the directions
will be included in the file names, so none of the versions will be overwritten.
For example, it saves the file \texttt{deg\_dist\_in.png} for in-degree.
To find the node with the largest in-degree (see end of Sect.~IV), we
use the commands
\begin{lstlisting}
G.direction = 'in'
G.largest_degrees()
\end{lstlisting}
\section{Conclusions and Outlook}
In this presentation we discussed some properties of the directed
graphs of dependency packages of a freely available operating system, the
Ubuntu 8.04 distribution of Linux. We have shown that the graphs are
scale-free, i.e.~the degree distributions are power-law like with
exponent -1.79 for in-degree, -2.11 for plain-degree. The out-degree
distribution has a kink, exhibiting power-law distributions with
exponent -1.87 below $k\simeq 30$ and -4.79 above. If we ignore this kink
and fit a single power-law function then we find -2.88 as exponent.
We also found the clustering coefficients $C_i$ of the graph and its
certain subgraphs, and found that $C_i$ falls off with $k_i$
proportional to $k^{-\beta}$, with $\beta$ being close to one, which
indicates that these networks are close to being hierarchical.
We also introduced a set of software tools that can easily be employed in the
classroom to teach the basic concepts used in studying the properties of
complex networks. In addition to the properties of complex networks
discussed here, these tools offer further options for investigations
in the classroom. In conclusion, we mention three possibilities. These
investigations are in progress.
\subsection{The small-world property of complex networks}
NetworkX has a function to get the diameter (the maximum length of the
shortest path in the graph) of an undirectied graph. Our graph is
directed and not connected. Therefore, we need to convert the graph
into undirected and also we have to choose the largest connected
component with 23760 nodes (as of August 2008). The following piece of
code does that:
\begin{lstlisting}
Gu = G.to_undirected()
networkx.is_connected(Gu) # I got False
cc = packages.connected_components()
cc0 = cc0
len(cc0) # answer: 23760
networkx.diameter()
# answer (after long time): 12
# How many nodes are in
# the 10 biggest connected components?
for i in cc[:10]:
print len(i),
# answer: 23760 16 12 11 11 8 5 5 5 5
\end{lstlisting}
For such a big network the running time of this code is quite long.
\subsection{Some generalized random networks and the hierarchical model}
We mentioned the Erd\H os--R\'enyi model and in the table
\ref{tab:models} we list another three worth to mention. These have
detailed description in the papers found in the references.
\begin{table}
\begin{tabularx}{\linewidth}{|X|*{4}{c|}}
\hline
Model & SW & PDD & CC& Ref.\\
\hline
Erd\H os--R\'enyi & no &no& no&\cite{albert-barabasi-2002}\cite{newman-2003-45}\\
Watts--Strogatz & yes & no & no&\cite{albert-barabasi-2002}\cite{newman-2003-45}\\
Albert--Barab\'asi&yes&yes&no&\cite{albert-barabasi-2002}\cite{newman-2003-45}\\
Hierarchical&yes&yes&yes&\cite{ravasz-2003-67}\\
\hline
\end{tabularx}
\caption{Properties explained by the models (SW=small-world
property, PDD=power-law degree distribution, CC=clustering coefficient
as a function of degree)}
\label{tab:models}%
\end{table}
\subsection{Connections with informatics}
Somtimes running programs requires very large CPU time. These days
CPU-s with more cores are quite common. In this case, the programmer can
split the program into parts. For example, having a dual core CPU, we were
able to halve the running time in searching for the common predecessors
(Section \ref{near}). To get the distances of nodes, the program needed to
investigate all (more then 320 million) pairs of packages. With running
the program twice in parallel with disjoint set of node pairs, and
storing the results, we were able to obtain the whole list of pairs
with distances $d_{ij}>0.2$.
%There are more complex cases and graphs with more nodes when we need
%parallel programming and distributed systems, such as GRID. Which can be
%covered in the education of computer scientists.
%\subsection{Some exercises for programmers}
%For students, has experience in programming, it is highly recommended to
%give more complex exercises we covered here.
%TODO
%Getting the differences between the states of the graph in several time.
%(Compare to evolution models)
%How change the properties of the graph the inclusion of the high $p$
%edges.
%Is there any cycle in the graph except of the pairs of edges between two
%nodes.
%Investigate the robustness against failures and attacks.
%xxx Deals with provides. (Multiedges?)
%Paralellizing the code where it is too slow.
%Make the steps of the hierarchical model with program.
%xxx algoritm for the tail of the fitting
%\cite{cancho-2001}
%\cite{ravasz-2003-67}
%\cite{newman-2003-45}
% can use a bibliography generated by BibTeX as a .bbl file
% BibTeX documentatio can be easily obtained at:
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
%\bibliography{IEEEabrv,../bib/paper}
\bibliography{network}
\end{document}