1 Introduction

In recent years, bibliometric analysis of scientific texts has become a topic of research interest in its own right. Although it was realized early on that scientific productivity could indeed be scrutinised by scientific methods [1, 2], it is only with the recent advent of online bibliographies and text repositories, with the enormous gain in computational power afforded by new generations of computing machines, and with the improved sophistication of algorithms, that the field has become accessible for in-depth analysis.

Based on the idea that science is performed as a social activity, the notion of social networks [3] has played an ever increasing role in the analysis of developments in science [4, 5]. Co-authorship networks are particularly interesting, as they reveal patterns of activity that both relate to single individuals and to the development of an entire field.

There are various tools for analysis, among them statistical analysis, visualisation, eigen analysis, and temporal analysis of these networks. We shall concentrate on eigen analysis.

Expressed as matrices, coauthorship networks are amenable to quantitative algebraic analysis. Eigen analysis can be thought of as similar to the way that the mean, variance, skewness, kurtosis and higher order cumulants can be used to summarise a distribution. Given enough higher order cumulants the complete distribution can be recovered exactly but the lower order cumulants (mean, variance,...) are often used to summarise it. In the same way the complete set of eigenvectors and their eigenvalues fully describes a matrix but the principle modes of the matrix are summarised by the eigenvectors with the largest (absolute value) eigenvalues. Thus the first eigenface captures the similarity between people’s faces, and higher order eigenfaces describe in increasing detail variation between faces [6, 7]. Although linear algebra is a classic technique, there have been interesting new uses for eigenvectors and their associated eigenvalues. Even the career choices of medical students [8] have been subjected to this kind of examination. In his wide raging analysis of co-citations of 50 leading medical informaticians [9], Andrews includes eigenvectors to capture the variability in his data.

In contrast to that study of co-citations among a small number of established experts we will use eigenvalue analysis of coauthorship on the whole of the genetic programming field as captured by the GP Bibliography http://www.cs.bham.ac.uk/∼wbl/biblio/. We use the same version of the GP-bibliography as was recently analysed by Tomassini and colleagues [10].

2 Co-authors in the GP bibliography

As of April 2006, the bibliography contained information on 3 078 GP authors and 4,128 papers, books, proceedings, technical reports etc. (See [11] for a recent study of the GP literature.) We concentrate upon the relationships between authors held in the bibliography. In particular we look at the 11 005 links between authors due to two or more people collaborating on writing (or editing) an entry in the bibliography. These links form a co-authorship graph.

As with other fields, the GP field is not fully interconnected by joint publications. That is the whole co-authorship graph is not a “small world” network, although Tomassini et al. [10] show that in important respects the connected component behaves like a small world network. In restricted domains, like literature studies, it is common for the graph to fall into disconnected parts. This is because the social interactions we look at, joint publications, are sparsely distributed. That is no one has published papers with more than a tiny fraction of other people in their field. This is certainly true of genetic programming. However, the data is both available and quantitative. In contrast, the criterion used the original small world study [12] was “to be on a first name basis” which is less tangible though there are many such links. Having many links virtually guarantees the graph is connected. A connected graph means the giant component encompasses everyone.

Although it would be possible, with care, to conduct an eigen analysis of the complete coauthorship graph, we shall concentrate on the largest connected component, since it is of the most general interest. This component contains 942 people who are together responsible for 2,144 entries, see Fig. 1. We used Graphviz’ neato to layout the graph. Neato uses the Kamada-Kawai algorithm to search for the layout which least stretches or compresses the links between nodes. To reduce clutter only links between the first author and the others are shown, however neato treats all links as if they were the same length.

Fig. 1
figure 1

Nine-hundred and forty-two people who have written one or more GP entries together. The area of circles is proportional to total number of entries in the GP bibliography. Lines indicate coauthorships. The largest components of the first 43 eigenvectors are coloured (using the same colours as in Table 1)

We construct a 942 × 942 symmetric integer matrix C. Where C ij  = C ji is the number of GP entries linking author i with author j and C ii is the total number of GP entries by author i. As expected, even in the largest connected component, most GP authors have not collaborated with most others, hence the elements of C are mostly zero, see Figs. 2 and 3.

Fig. 2
figure 2

Connectivity of giant component of the GP-bib connectivity matrix C. Squares on the diagonal are cliques formed by entries with a large number of coauthors, e.g. GECCO 2005 and GECCO 2003

Fig. 3
figure 3

Number of coauthored/coedited GP papers in the giant connected component of the GP-bib by coauthor pairings, cf. Fig. 2. Prolific pairings are annotated with the coauthors’ initials. To avoid clutter, only collaborations are shown, i.e. the diagonal and the lower part of C are not plotted

Since C does not contain complex numbers and is symmetric, its eigenvalues and eigenvectors are also not complex. Positive eigenvalues correspond to clusters in the graph (non-zero blocks near the matrix diagonal) whilst negative eigenvalues correspond to partitioning the graph (off-diagonal blocks). The 942 eigenvalues of C are plotted in reverse order in Fig. 4. That is smallest eigenvalue on the right. 223 eigenvalues are within rounding error of zero. A further 264 are between zero and 1.0. That is half (487) of the eigenvalues are 1.0 or less. Figure 4 (solid line) shows the size of the eigen values grows rapidly. Indeed it (dashed line) suggests a power law relationship between size and rank.

Fig. 4
figure 4

Nine-hundred and forty-two Eigen values of the GP-bib connectivity matrix C

Note the eigenvalues and eigenvectors form an orthonormal basis from which it is possible to exactly reconstruct the original matrix. The absolute magnitude of the eigenvalues gives the relative importance of each eigenvector. That is, it is possible to approximate the original matrix by ignoring eigenvectors whose eigenvalue is near zero. Another way of looking at this is to say that the large eigenvalue/eigenvectors pairs provide a lossy way to compress the matrix or capture its essence. As more eigenvectors are used the reconstruction becomes more accurate. Improvements can be either up or down. Positive eigenvector/eigenvalues increase matrix values whilst negative eigenvector/eigenvalues reduce them. We will come across reductions in Sect. 3.

Figure 4 shows the size of eigenvalues falls rapidly. We imposed a cut off on the distribution and chose to study in detail the largest 43 eigenvalues and their eigenvectors. (i.e. leaving out the smaller 900 or so eigen components). Figure 5 shows the error in the matrix reconstructed from its eigen components falls exponentially (dotted line) as more components are used. With 43 eigenvectors 70% of the graph can be reconstructed and a 100% accurate reconstruction needs only 602 eigenvectors.

Fig. 5
figure 5

Accuracy of reconstructed giant component of GP coauthor graph

Remember the 942 eigenvectors form an orthonormal set. That is each is at right angles to the others by construction. Therefore, the elements of each eigenvector can be either positive or negative. For convenience they are normalised to be of unit length. We consider the largest (absolute) components of each of the 43 vectors so that together they account for 90% of the vectors’ length. The number of components needed depends upon the vector and varies from 1 to 209. (From 2 to 60 in those corresponding to the 43 largest eigenvectors.) These are listed in Table 1.

Table 1 The largest 43 of the 942 eigenvalues of the largest connected component of the GP coauthorship graph. Eigenvalue, followed by the smallest number of components of the corresponding eigenvector which account for 90% of the vectors length, followed by the those components. Initials used to save space

Since Table 1 holds more than 90% of the data from the eigenvectors with the largest eigenvalues, we can say it captures the essential data from the connected component of the genetic programming bibliography. Of course this does not mean we could reconstruct the bibliography, or even the coauthor graph, from it. It does not contain fine enough detail for that. However, it does provide a summary of the coauthor relationships within genetic programming.

3 Analysis of eigenvectors

Table 1 contains eigenvectors sorted by order of their eigenvalue (first column). Each vector contains 942 elements, one per author. In Table 1 these are sorted by their absolute magnitude and only the name of the element and the values of the larger elements are given. Column 2 gives the number of elements displayed which, as described above, is the smallest number of elements needed to convey 90% of the eigenvector.

The first two eigenvectors are easy to interpret, cf. Fig. 6. All their large elements are positive and there are few of them. They summarise parts of the coauthor graph whose nodes (authors) do indeed collaborate and have many joint publications. However these provide only a summary of the interactions in both groups, as over time collaborative links might have been both strengthened and weakened. A single eigenvector cannot capture all of this information. Instead, as we shall see later, other eigenvectors strengthen and weaken links established earlier, so that the collaborations are represented with greater precision.

Fig. 6
figure 6

Picture of first (left) second (center) and third (right) eigen components. To reduce clutter links with fewer than ten papers are not shown. Yellow indicates reduction in link strength

If we start at the top of Table 1 we can reconstruct our field in increasing detail. Each time we consider a new eigenvalue/eigenvector pair the resolution is improved.

The third eigenvector (Fig. 6) has four positive elements, corresponding to four collaborators but also a negative element (RP −0.17). The negative element does not mean that this author does not collaborate with the others. When we add the third eigenvector, we add four nodes (and up to 28 new links) but by the time of the third eigenvector there is overlap with the two larger eigen components. The negative value indicates the third eigenvector represents a tight group of four collaborators and that the collaborative links with one of the existing nodes in the reconstructed network must be weakened.

The fourth eigen component, is an example where most of the eigenvector’s large components are negative. However there is one element (RP 0.13) with a large positive value, indicating that the overlap between this group of co-authors and the existing graph must be reduced for this element, rather than increased.

Most of the large elements of the fifth eigenvalue are positive whilst the largest is negative. Perhaps this can be thought of best, not as collaborators of the first author (WBL −0.71) but of the second (RP 0.60), where, again, the sign reversal indicates not that these two coauthors do not collaborate but that the links already established by the second eigenvector must be adjusted. Note one author (BB −0.09) has not collaborated with the second author but with the first and so his eigenvector element has the same sign as the first author and the opposite of the second.

The next four eigenvalues all have large elements of the same sign. Each represents an important collaboration in our field.

The tenth eigenvector (David Andre) is again of mixed sign and represents an important readjustment of the collaborative links. Most of the large elements represent adjustments to links between John Koza’s coauthors and the addition of five of his coauthors who were not among the major component of the first eigenvector. Secondly, elements PN 0.22, WB −0.14 and HI −.08 adjust links between coauthors in the third and seventh eigenvectors.

The eleventh eigenvector (Peter Nordin) can be thought of as similar to the fifth. It, too, contains a mixture of positive (PN) and negative (WB) values. The positive elements creating links between PN and important collaborators, whilst the negative ones mostly create new links for WB. While others appear to be making modest adjustments to existing members of the graph. Note some links are established by collaborations on coediting conference proceedings.

The twelfth eigenvector is the first one to contain many large elements (32). Many of these (which are all of the same sign) represent collaborative links established by Una-May O’Reilly through coediting Advances in Genetic Programming, EuroGP, GECCO and Genetic Programming Theory and Practice. Others represent collaborations on papers.

The next eigenvector has all positive large elements and introduces Jason Daida and his collaborators. The fourteenth summarises the collaborations of Maarten Keijzer (note the data are prior to the publication of GECCO 2006, which he edited). However, two opposite sign elements (CR −0.16 and JFM −0.11) again indicate adjustment to existing links. The fifteenth eigenvector also contains many large elements (18). These are mostly the collaborators of Lee Spector. Again elements with the opposite sign suggest adjustment to the existing network (particularly for John Koza’s coauthors and those introduced by the previous eigenvector). While one element (JBP 0.05) represents the addition of a coauthor of a coeditor (PJA 0.10) of Lee.

Ten of the remaining 28 large eigen components have relatively few large elements and these are mostly of the same sign. They can be thought of as mostly describing an important group of collaborating authors. Other eigenvectors tend to have more large components with more evenly matched number of positive and negative elements. While they introduce new authors they also play an important role in adjusting strengths of existing links.

The two eigenvectors for Tom Hayes have similar eigenvalues (\(\approx 43\)) and are relatively diffuse (15 and 20 major elements) but 12 authors play major roles in both. Again positive and negative values appear, indicating these two eigen components play an important role in recording the asymmetry in the graph. Similarly the pair of eigenvectors for Nick McPhee (60 and 40 components) both have eigenvalues of about 27 and have 35 coauthors in common. The mix of signs indicates the formation of clusters with strong links within a cluster but no or weak links between strong centres.

4 Conclusions

Graphical tools, like neato, provide a valuable way to represent the coauthor links within the genetic programming bibliography. However, with large networks like this, one is rapidly swamped by details and it is difficult to annotate them meaningfully. Eigen analysis provides a ready way to extract quantitative signals from the data in a significant order.

As one studies Table 1 from the top, one quickly moves from anticipated relationships to surprises. These where discussed in Sect. 3.