Who Says What to Whom on Twitter
Shaomei Wu
Cornell University, USA
sw475@cornell.edu
Jake M. Hofman
Yahoo! Research, NY, USA
hofman@yahoo-inc.com
Winter A. Mason
Yahoo! Research, NY, USA
winteram@yahoo-
inc.com
Duncan J. Watts
Yahoo! Research, NY, USA
djw@yahoo-inc.com
ABSTRACT
We study several longstanding questions in media communi-
cations research, in the context of the microblogging service
Twitter, regarding the production, flow, and consumption
of information. To do so, we exploit a recently introduced
feature of Twitter—known as Twitter lists—to distinguish
between elite users, by which we mean specifically celebri-
ties, bloggers, and representatives of media outlets and other
formal organizations, and ordinary users. Based on this clas-
sification, we find a striking concentration of attention on
Twitter—roughly 50% of tweets consumed are generated by
just 20K elite users—where the media produces the most in-
formation, but celebrities are the most followed. We also find
significant homophily within categories: celebrities listen to
celebrities, while bloggers listen to bloggers etc; however,
bloggers in general rebroadcast more information than the
other categories. Next we re-examine the classical “two-step
flow” theory of communications, finding considerable sup-
port for it on Twitter, but also some interesting differences.
Third, we find that URLs broadcast by different categories
of users or containing different types of content exhibit sys-
tematically different lifespans. And finally, we examine the
attention paid by the different user categories to different
news topics.
Categories and Subject Descriptors
H.1.2 [Models and Principles]: User/Machine Systems;
J.4 [Social and Behavioral Sciences]: Sociology
General Terms
two-step flow, communications, classification
Keywords
Communication networks, Twitter, information flow
Part of this research was performed while the author was
visiting Yahoo! Research, New York.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WWW ’11 Hyderabad, India
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
1. INTRODUCTION
A longstanding objective of media communications re-
search is encapsulated by what is known as Lasswell’s maxim:
“who says what to whom in what channel with what ef-
fect” [9], so-named for one of the pioneers of the field, Harold
Lasswell. Although simple to state, Laswell’s maxim has
proven difficult to satisfy in the more-than 60 years since
he stated it, in part because it is generally difficult to ob-
serve information flows in large populations, and in part
because different channels have very different attributes and
effects. As a result, theories of communications have tended
to focus either on “mass” communication, defined as “one-
way message transmissions from one source to a large, rela-
tively undifferentiated and anonymous audience,” or on “in-
terpersonal” communication, meaning a “two-way message
exchange between two or more individuals.” [13].
Correspondingly, debates among communication theorists
have tended to revolve around the relative importance of
these two putative modes of communication. For exam-
ple, whereas early theories such as the so-called“hypodermic
model” posited that mass media exerted direct and relatively
strong effects on public opinion, mid-century researchers [10,
6, 11, 4] argued that the mass media influenced the pub-
lic only indirectly, via what they called a two-step flow of
communications, where the critical intermediate layer was
occupied by a category of media-savvy individuals called
opinion leaders. The resulting “limited effects” paradigm
was then subsequently challenged by a new generation of
researchers [5], who claimed that the real importance of the
mass media lay in its ability to set the agenda of public
discourse. But in recent years rising public skepticism of
mass media, along with changes in media and communica-
tion technology, have tilted conventional academic wisdom
once more in favor of interpersonal communication, which
some identify as a “new era” of minimal effects [2].
Recent changes in technology, however, have increasingly
undermined the validity of the mass vs. interpersonal di-
chotomy itself. On the one hand, over the past few decades
mass communication has experienced a proliferation of new
channels, including cable television, satellite radio, special-
ist book and magazine publishers, and of course an array of
web-based media such as sponsored blogs, online communi-
ties, and social news sites. Correspondingly, the traditional
mass audience once associated with, say, network television
has fragmented into many smaller audiences, each of which
increasingly selects the information to which it is exposed,
and in some cases generates the information itself. Mean-
while, in the opposite direction interpersonal communication
has become increasingly amplified through personal blogs,
email lists, and social networking sites to afford individu-
als ever-larger audiences. Together, these two trends have
greatly obscured the historical distinction between mass and
interpersonal communications, leading some scholars to refer
instead to “masspersonal” communications [13].
Nowhere is the erosion of traditional categories more ap-
parent than in the micro-blogging platform Twitter. To il-
lustrate, the top ten most-followed users on Twitter are not
corporations or media organizations, but individual people,
mostly celebrities. Moreover, these individuals communi-
cate directly with their millions of followers, often managed
by themselves or publicists, thus bypassing the traditional
intermediation of the mass media between celebrities and
fans. Next, in addition to conventional celebrities, a new
class of“semi-public”individuals like bloggers, authors, jour-
nalists, and subject matter experts have come to occupy an
important niche on Twitter, in some cases becoming more
prominent than traditional public figures such as entertain-
ers and elected officials. Third, in spite of these shifts away
from centralized media power, media organizations—along
with corporations, governments, and NGOs—all remain well
represented among highly followed users, and are often ex-
tremely active. And finally, Twitter is primarily made up
of many millions of users who seem to be ordinary individ-
uals communicating with their friends and acquaintances in
a manner largely consistent with traditional notions of in-
terpersonal communication.
Twitter, therefore, represents the full spectrum of commu-
nications from personal and private to “masspersonal”to tra-
ditional mass media. Consequently it provides an interesting
context in which to address Lasswell’s maxim, especially as
Twitter—unlike television, radio, and print media—enables
one to easily observe information flows among the members
of its ecosystem. Unfortunately, however, the kinds of ef-
fects that are of most interest to communications theorists,
such as changes in behavior, attitudes, etc., remain difficult
to measure on Twitter. Therefore in this paper we limit
our focus to the “who says what to whom” part of Laswell’s
maxim.
To this end, our paper makes three main contributions:
We introduce a method for classifying users using Twit-
ter Lists into “elite” and “ordinary” users, further clas-
sifying elite users into one of four categories of interest—
media, celebrities, organizations, and bloggers.
We investigate the flow of information among these
categories, finding that although audience attention is
highly concentrated on a minority of elite users, much
of the information they produce reaches the masses
indirectly via a large population of intermediaries.
We find that different categories of users place slightly
different emphasis on different types of content, and
that different content types exhibit dramatically dif-
ferent characteristic lifespans, ranging from less than
a day to months.
The remainder of the paper proceeds as follows. In the
next section, we review related work. In section 3 we dis-
cuss our data and methods, including section 3.3 in which
we describe how we use Twitter Lists to classify users, out-
line two different sampling methods, and show that they
deliver qualitatively similar results. In section 4 we analyze
the production of information on Twitter, particularly who
pays attention to whom. In section 4.1, we revisit the the-
ory of the two-step flow—arguably the dominant theory of
communications for much of the past 50 years—finding con-
siderable support for the theory as well as some interesting
differences. In section 5, we consider “who listens to what”,
examining first who shares what kinds of media content, and
second the lifespan of URLs as a function of their origin and
their content. Finally, in section 6 we conclude with a brief
discussion of future work.
2. RELATED WORK
Aside from the communications literature surveyed above,
a number of recent papers have examined information dif-
fusion on Twitter. Kwak et al. [8] studied the topological
features of the Twitter follower graph, concluding from the
highly skewed nature of the distribution of followers and the
low rate of reciprocated ties that Twitter more closely resem-
bled an information sharing network than a social network—
a conclusion that is consistent with our own view. In ad-
dition, Kwak et al. compared three different measures of
influence—number of followers, page-rank, and number of
retweets—finding that the ranking of the most influential
users differed depending on the measure. In a similar vein,
Cha et al. [3] compared three measures of influence—number
of followers, number of retweets, and number of mentions—
and also found that the most followed users did not neces-
sarily score highest on the other measures. Weng et al. [15]
compared number of followers and page rank with a modified
page-rank measure which accounted for topic, again finding
that ranking depended on the influence measure. Finally,
Bakshy et al. [1] studied the distribution of retweet cascades
on Twitter, finding that although users with large follower
counts and past success in triggering cascades were on aver-
age more likely to trigger large cascades in the future, these
features are in general poor predictors of future cascade size.
Our paper differs from this earlier work by shifting atten-
tion from the ranking of individual users in terms of various
influence measures to the flow of information among dif-
ferent categories of users. In particular, we are interested
in identifying “elite” users, who we differentiate from “ordi-
nary” users in terms of their visibility, and understanding
their role in introducing information into Twitter, as well as
how information originating from traditional media sources
reaches the masses.
3. DATA AND METHODS
3.1 Twitter Follower Graph
In order to understand how information is transmitted on
Twitter, we need to know the channels by which it flows;
that is, who is following whom on Twitter. To this end, we
used the follower graph studied by Kwak et al. [8], which
included 42M users and 1.5B edges. This data represents
a crawl of the graph seeded with all users on Twitter as
observed by July 31st, 2009, and is publicly available
1
. As
reported by Kwak et al. [8], the follower graph is a directed
network characterized by highly skewed distributions both
1
The data is free to download from
http://an.kaist.ac.kr/traces/WWW2010.html
of in-degree (# followers) and out-degree (#“friends”, Twit-
ter notation for how many others a user follows); however,
the out-degree distribution is even more skewed than the
in-degree distribution. In both friend and follower distribu-
tions, for example, the median is less than 100, but the max-
imum # friends is several hundred thousand, while a small
number of users have millions of followers. In addition, the
follower graph is also characterized by extremely low reci-
procity (roughly 20%)—in particular, the most-followed in-
dividuals typically do not follow many others. The Twitter
follower graph, in other words, does not conform to the usual
characteristics of social networks, which exhibit much higher
reciprocity and far less skewed degree distributions [7], but
instead resembles more the mixture of one-way mass com-
munications and reciprocated interpersonal communications
described above.
3.2 Twitter Firehose
In addition to the follower graph, we are interested in the
content being shared on Twitter—particularly URLs—and
so we examined the corpus of all 5B tweets generated over
a 223 day period from July 28, 2009 to March 8, 2010 us-
ing data from the Twitter “firehose,” the complete stream
of all tweets
2
. Because our objective is to understand the
flow of information, it is useful for us to restrict attention to
tweets containing URLs, for two reasons. First, URLs add
easily identifiable tags to individual tweets, allowing us to
observe when a particular piece of content is either retweeted
or subsequently reintroduced by another user. And second,
because URLs point to online content outside of Twitter,
they provide a much richer source of variation than is pos-
sible in the typical 140 character tweet. Finally, we note
that almost all URLs broadcast on Twitter have been short-
ened using one of a number of URL shorteners, of which the
most popular is http://bit.ly/. From the total of 5B tweets
recorded during our observation period, therefore, we focus
our attention on the subset of 260M containing bit.ly URLs.
3.3 Twitter Lists
Our method for classifying users exploits a relatively re-
cent feature of Twitter: Twitter Lists. Since its launch on
November 2, 2009, Twitter Lists have been welcomed by the
community as a way to group people and organize one’s in-
coming stream of tweets by specific sets of users. To create
a Twitter List, a user needs to provide a name (required)
and description (optional) for the list, and decide whether
the new list is public (anyone can view and subscribe to this
list) or private (only the list creator can view or subscribe to
this list). Once a list is created, the user can add/edit/delete
list members. As the purpose of Twitter Lists is to help users
organize users they follow, the name of the list can be con-
sidered a meaningful label for the listed users. List creation
therefore effectively exploits the “wisdom of crowds” [12]
to the task of classifying users, both in terms of their im-
portance to the community (number of lists on which they
appear), and also how they are perceived (e.g. news organi-
zation vs. celebrity, etc.).
Before describing our methods for classifying users in terms
of the lists on which they appear, we emphasize that we
are motivated by a particular set of substantive questions
arising out of communications theory. In particular, we
2
http://dev.twitter.com/doc/get/statuses/firehose
are interested in the relative importance of mass commu-
nications, as practiced by media and other formal organiza-
tions, masspersonal communications as practiced by celebri-
ties and prominent bloggers, and interpersonal communica-
tions, as practiced by ordinary individuals communicating
with their friends. In addition, we are also interested in the
relationships between these categories of users, motivated
by theoretical arguments such as the theory of the two-step
flow [6]. Rather than pursuing a strategy of automatic clas-
sification, therefore, our approach depends on defining and
identifying certain predetermined classes of theoretical in-
terest, where both approaches have advantages and disad-
vantages. In particular, we restrict our attention to four
classes of what we call “elite” users: media, celebrities, orga-
nizations, and bloggers, as well as the relationships between
these elite users and the much larger population of “ordi-
nary” users.
In additional to these theoretically-imposed constraints,
our proposed classification method must also satisfy a prac-
tical constraint—namely that the rate limits established by
Twitter’s API effectively preclude crawling all lists for all
Twitter users
3
. Thus we instead devised two different sam-
pling schemes—a snowball sample and an activity sample—
each with some advantages and disadvantages, discussed be-
low.
3.3.1 Snowball sample of Twitter Lists
The first method for identifying elite users employed snow-
ball sampling. For each category, we chose a number u
0
of
seed users that were highly representative of the desired cat-
egory and appeared on many category-related lists. For each
of the four categories above, the following seeds were chosen:
Celebrities: Barack Obama, Lady Gaga, Paris Hilton
Media: CNN, New York Times
Organizations: Amnesty International, World Wildlife
Foundation, Yahoo! Inc., Whole Foods
Blogs
4
: BoingBoing, FamousBloggers, problogger, mash-
able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,
dragonblogger, bbrian017, hishaman, copyblogger, en-
gadget, danielscocco, BlazingMinds, bloggersblog, Ty-
coonBlogger, shoemoney, wchingya, extremejohn,
GrowMap, kikolani, smartbloggerz, Element321, bran-
donacox, remarkablogger, jsinkeywest, seosmarty, No-
tAProBlog, kbloemendaal, JimiJones, ditesco
After reviewing the lists associated with these seeds, the
following keywords were hand-selected based on (a) their
representativeness of the desired categories; and (b) their
lack of overlap between categories:
3
The Twitter API allows only 20K calls per hour, where at
most 20 lists can be retrieved for each API call. Under the
modest assumption of 40M users (roughly the number in the
2009 crawl by [8]), where each user is included on at most
20 lists, this would require 4 10
6
/2 10
3
= 2, 000 hours, or
11 weeks. Clearly this time could be reduced by deploying
multiple accounts, but it also likely underestimates the real
time quite significantly, as many users appear on many more
than 20 lists (e.g. Lady Gaga appears on nearly 140,000)
4
The blogger category required many more seeds because
bloggers are in general lower profile than the seeds for the
other categories
u
0
l
0
u
1
l
1
u
2
l
2
Figure 1: Schematic of the Snowball Sampling
Method
Celebrities: star, stars, hollywood, celebs, celebrity,
celebrities, celebsverified, celebrity-list,celebrities-on-
twitter, celebrity-tweets
Media: news, media, news-media
Organizations: company, companies, organization,
organisation, organizations, organisations, corporation,
brands, products, charity, charities, causes, cause, ngo
Blogs: blog, blogs, blogger, bloggers
Having selected the seeds and the keywords for each cate-
gory, we then performed a snowball sample of the bipartite
graph of users and lists (see Figure 1). For each seed, we
crawled all lists on which that seed appeared. The resulting
“list of lists” was then pruned to contain only the l
0
lists
whose names matched at least one of the chosen keywords
for that category. For instance, Lady Gaga is on lists called
“faves”, “celebs”, and “celebrity”, but only the latter two lists
would be kept after pruning. We then crawled all u
1
users
appearing in the pruned “list of lists” (for instance, find-
ing all users that appeared in the “celebrity” list with Lady
Gaga), and then repeated these last two steps to complete
the crawl. In total, 524, 116 users were obtained, who ap-
peared on 7, 000, 000 lists; however, many of the more promi-
nent users appeared on lists in more than one category—for
example Oprah Winfrey is frequently included in lists of
“celebrity” as well as “media.” To resolve this ambiguity, we
computed a user i’s membership score in category c:
w
ic
=
n
ic
N
c
,
where n
ic
is the number of lists in category c that contain
user i and N
c
is the total number of lists in category c.
We then assigned each user to the category in which he
or she has the highest membership score. The number of
users assigned in this manner to each category is reported
in Table 1.
3.3.2 Activity Sample of Twitter Lists
Although the snowball sampling method is convenient and
is easily interpretable with respect to our theoretical moti-
vation, it is also potentially biased by our particular choice
of seeds. To address this concern, we also generate a sample
of users based on their activity. Specifically, we crawl all
lists associated with all users who tweet at least once every
week for our entire observation period.
This “activity-based” sampling method is also clearly bi-
ased towards users who are consistently active. Importantly,
Table 1: Distribution of users over categories
Snowball Sample Activity Sample
category # of users % of users # of users % of users
celeb 82,770 15.8% 14,778 13.0%
media 216,010 41.2% 40,186 35.3%
org 97,853 18.7% 14,891 13.1%
blog 127,483 24.3% 43,830 38.6%
total 524,116 100% 113,685 100%
however, the bias is likely to be quite different from any in-
troduced by the snowball sample; thus obtaining similar re-
sults from the two samples should give us confidence that our
findings are not artifacts of the sampling procedure. This
method initially yielded 750k users and 5M lists; however,
after pruning the lists to those that contained at least of the
keywords above, and assigning users to unique categories
(as described above), we obtained a much-reduced sample
of 113,685 users, where Table 1 reports the number of users
assigned to each category. We note that the number of lists
obtained by the activity sampling methods is considerably
smaller than that obtained by the snowball sample, and
that bloggers are more heavily represented among the ac-
tivity sample at the expense of the other three categories—
consistent with our claim that the two methods introduce
different biases. Interestingly, however, 97,614 of the ac-
tivity sample, or 85%, also appear in the snowball sample,
suggesting that the two sampling methods identifying sim-
ilar populations of elite users–as indeed we confirm in the
next section.
3.3.3 Classifying Elite Users
In order to identify categories of elite users, we not only
need to classify users into categories, but also arrive at a def-
inition of “elite” that satisfies a tradeoff between (a) keeping
each category relatively small, so as not to include users who
are not distinguishable from ordinary users, while (b) maxi-
mizing the volume of attention that is accounted for by each
category. In addition, it is also desirable to make the four
categories the same size, so as to facilitate comparisons. To
this end, we first rank all users in each of category by how
frequently they are listed in that category. Next, we mea-
sure the flow of information from the top k users in each
of the four categories to a random sample of 100K ordinary
(i.e. unclassified) users in two ways: the proportion of peo-
ple the user follows in each category, and the proportion of
tweets the user received from everyone the user follows in
each category.
Figure 2(a) shows for the snowball sample the share of
following links (square symbols) and tweets received (dia-
monds) by an average user, while Figure 2(b) shows the
same information for the activity sample. Although the nu-
merical values differ slightly, the two sets of results are qual-
itatively similar. In particular, for both sampling methods,
celebrities outrank all other categories, followed by the me-
dia, organizations, and bloggers. Also in both cases, the
bulk of the attention is accounted for by a relatively small
number of users within each category, as evidenced by the
relatively flat slope of the attention curves, where we note
that the curve for celebrities asymptotes more slowly than
for the other three categories. Balancing the requirements
described above, therefore, we chose k = 5000 as a cut-off
for the elite categories, where all remaining users are hence-
forth classified as ordinary. In addition, from this point on,
we restrict our analysis to elite categories to the top 5,000
users identified by the sampling method, noting that both
methods generate similar results.
0 10 20 30
celebrities
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
media
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
organizations
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
blogs
top k
average %
1000 4000 7000 10000
friends
tweets received
(a) Snowball sample
0 10 20 30
celebrities
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
media
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
organizations
top k
average %
1000 4000 7000 10000
friends
tweets received
0 10 20 30
blogs
top k
average %
1000 4000 7000 10000
friends
tweets received
(b) Activity sample
Figure 2: Average fraction of # following (blue line)
and # tweets (red line) for a random user that are
accounted for by the top K elites users crawled
Based on this definition of elite users, Table 2 shows that
although ordinary users collectively introduce by far the
highest number of URLs, members of the elite categories are
far more active on a per-capita basis. In particular, users
classified as “media” easily outproduce all other categories,
followed by bloggers, organizations, and celebrities. Ordi-
nary users originate on average only about 6 URLs each,
compared with over 1,000 for media users. In the rest of
this paper, therefore, when we talk about “celebrity”, “me-
dia”, “organization”, “blog”, we refer the top 5K users drawn
from the snowball sample listed as “celebrity”, “media”, “or-
ganization”, “blog”, respectively.
Table 3, which shows the top 5 users in each of the four
categories, suggests that the sampling method yields re-
sults that are consistent with our objective of identifying
users who are prominent exemplars of our target categories.
Among the celebrity list, for example, “aplusk,” is the han-
Table 2: # of URLs initiated by category
# of URLs
category # of URLs per-capita
celeb 139,058 27.81
media 5,119,739 1023.94
org 523,698 104.74
blog 1,360,131 272.03
ordinary 244,228,364 6.10
dle for actor Ashton Kusher, one of the first celebrities to
embrace Twitter and still one of the most followed users,
while the remain celebrity users—Lady Gaga, Ellen De-
generes, Oprah Winfrey, and Taylor Swift, are all household
names. In the media category, CNN Breaking News and the
New York Times are most prominent, followed by Breaking
News, Time, and Asahi, a leading Japanese daily newspa-
per. Among organizations, Google, Starbucks, and Twit-
ter are obviously large and socially prominent corporations,
while JoinRed is the charity organization started by Bono of
U2, and ollehkt is the Twitter account for KT, formerly Ko-
rean Telecom. Finally, among the blogging category, Mash-
able and ProBlogger are both prominent US blogging sites,
while Kibe Loco and Nao Salvo are popular blogs in Brazil,
and dooce is the blog of Heather Armstrong, a widely read
“mommy blogger” with over 1.5M followers.
Table 3: Top 5 users in each category
Celebrity Media Org Blog
aplusk cnnbrk google mashable
ladygaga nytimes Starbucks problogger
TheEllenShow asahi twitter kibeloco
taylorswift13 BreakingNews joinred naosalvo
Oprah TIME ollehkt dooce
4. “WHO LISTENS TO WHOM”
The results of the previous section provide qualified sup-
port for the conventional wisdom that audiences have be-
come increasingly fragmented. Clearly, ordinary users on
Twitter are receiving their information from many thou-
sands of distinct sources, most of which are not traditional
media organizations—even though media outlets are by far
the most active users on Twitter, only about 15% of tweets
received by ordinary users are received directly from the
media. Equally interesting, however, is that in spite of this
fragmentation, it remains the case that 20K elite users, com-
prising less than 0.05% of the user population, attracts al-
most 50% of all attention within Twitter. Even if the media
has lost attention relative to other elites, information flows
have not become egalitarian by any means.
The prominence of elite users also raises the question of
how these different categories listen to each other. To ad-
dress this issue, we compute the volume of tweets exchanged
between elite categories. Specifically, Figure 3 shows the
average percentage of tweets that category i receives from
category j, exhibiting striking homophily with respect to
attention: celebrities overwhelmingly pay attention to other
celebrities, media actors pay attention to other media ac-
tors, and so on. The one slight exception to this rule is that
organizations pay more attention to bloggers than to them-
Celeb Media
Org Blog
A B
Category of Twitter Users
B receive tweets from A
Figure 3: Share of tweets received among elite cat-
egories
Figure 4: RT behavior among elite categories
selves. In general, in fact, attention paid by organizations is
more evenly distributed across categories than for any other
category.
Figure 3, it should be noted, shows only how many URLs
are received by category i from category j, a particular weak
measure of attention for the simple reason that many tweets
go unread. A stronger measure of attention, therefore, is
to consider instead only those URLs introduced by category
i that are subsequently retweeted by category j. Figure 4
shows how much information originating from each category
is retweeted by other categories. As with our previous mea-
sure of attention, retweeting is strongly homophilous among
elite categories; however, bloggers are disproportionately re-
sponsible for retweeting URLs originated by all categories.
This result reflects the characterization of bloggers as recy-
clers and filters of information. However, even though on a
per-capita basis bloggers disproportionately occupy the role
of information recyclers—93 retweets per person, compared
to only 1.1 retweets per person for ordinary users—the total
number of URLs retweeted by bloggers (465k) is vastly out-
weighed by the number retweeted by ordinary users (46M);
thus their overall impact is relatively minimal.
4.1 Two-Step Flow of Information
Examining information flow on Twitter can also shed new
light on the theory of the two-step flow, arguably the theory
that has most successfully captured the dueling importance
of mass media and interpersonal influence. The essence of
the two-step flow is that information passes from the media
to the masses not directly, as supposed by early theories of
mass communication, but passes first through an intermedi-
ate layer of “opinion leaders” who decide which information
to rebroadcast to their followers, and which to ignore. As
we have already noted, on Twitter the flow of information
to the masses from the media accounts for only a fraction
of the total volume of information. Nevertheless, it is still a
substantial fraction, so it is still interesting to ask: for the
special case of information originating from media sources,
what proportion is broadcast directly to the masses, and
what proportion is transmitted indirectly via some popula-
tion of intermediaries? In addition, we may inquire whether
these intermediaries, to the extent they exist, are drawn
from other elite categories or from ordinary users, as claimed
by the two-step flow theory; and if the latter, in what re-
spects they differ from other ordinary users.
Before proceeding with this analysis, we note that there
are two ways information can pass through an intermedi-
ary in Twitter. The first is via retweeting, which occurs
when a users explicitly rebroadcasts a URL that he or she
has received, along with an explicit acknowledgement of the
source—either using official retweet function provided by
Twitter, or making use of an informal convention such as
“RT @user” or “via @user.” The second mechanism is what
we label reintroduction, where a user subsequently tweets a
URL that has previously been introduced by another user,
but without the acknowledgment, in which case we assume
the information has been rediscovered independently. For
the purposes of studying when a user receives information
directly from the media or indirectly through an intermedi-
ary, we treat retweets and reintroductions equivalently. If
the first occurrence of a URL in Twitter came from a media
user, but a user received the URL from another source, then
that source can be considered an intermediary, whether they
are citing the source within Twitter by retweeting the URL,
or reintroducing it, having discovered the URL outside of
Twitter.
To quantify the extent to which ordinary users get their
information indirectly versus directly from the media, we
sampled 1M random ordinary users
5
, and for each user,
counted the number n of bit.ly URLs they had received that
had originated from one of our 5K media users, where of
the 1M total, 600K had received at least one such URL.
For each member of this 600K subset we then counted the
number n
2
of these URLs that they received via non-media
friends; that is, via a two-step flow. The average fraction
n
2
/n = 0.46 therefore represents the proportion of media-
originated content that reaches the masses via an interme-
diary rather than directly. As Figure 5 shows, however,
this average is somewhat misleading. In reality, the pop-
ulation comprises two types—those who receive essentially
all of their media-originating information via two-step flows
and those who receive virtually all of it directly from the
media. Unsurprisingly, the former type is exposed to less
total media than the latter. What is surprising, however, is
that even users who received up to 100 media URLs dur-
5
As before, performing this analysis for the entire population
of over 40M ordinary users proved to be computationally
unfeasible.
a b
c
d
Figure 5: Percentage of information that is received
via an intermediary as a function of total volume of
media content to which a user is exposed.
ing our observation period received all of them from opinion
leaders.
Who are these intermediaries, and how many of them are
there? In total, the population of intermediaries is smaller
than that of the users who rely on them, but still surprisingly
large, roughly 500K, the vast majority of which (96%) are
classified ordinary users, not elites. Interestingly, Figure 5c
also shows that at least some intermediaries also receive the
bulk of their media content indirectly, just like other ordi-
nary users. Comparing Figure 5a and 5c, however, we note
that intermediaries are not like other ordinary users in that
they are exposed to considerably more media than randomly
selected users, hence the number of intermediaries who rely
on two-step flows is much smaller than for random users. In
addition, we find that on average intermediaries have more
followers than randomly sampled users (543 followers versus
34) and are also more active (180 tweets on average, versus
7). Finally, Figure 6 shows that although all intermediaries,
by definition, pass along media content to at least one other
user, a minority satisfies this function for multiple users,
where we note that the most prominent intermediaries are
disproportionately drawn from the 4% elite users—Ashton
Kucher (asplusk), for example acts as an intermediary for
over 100,000 users.
Interestingly, these results are all broadly consistent with
the original conception of the two-step flow, advanced over
50 years ago, which emphasized that opinion leaders were
“distributed in all occupational groups, and on every social
and economic level,” corresponding to our classification of
most intermediaries as ordinary. [6]. The original theory
also emphasized that opinion leaders, like their followers,
also received at least some of their information via two-step
flows, but that in general they were more exposed to the
media than their followers—just as we find here. Finally,
the theory predicted that opinion leadership was not a bi-
nary attribute, but rather a continuously varying one, cor-
responding to our finding that intermediaries vary widely in
the number of users for whom they act as filters and trans-
mitters of media content. Given the length of time that has
elapsed since the theory of the two-step flow was articulated,
and the transformational changes that have taken place in
1 100 10000
# of two−step recipients
# of opinion leaders
0 4 16 64 256 2048 16384 131072
Figure 6: Frequency of intermediaries binned by #
randomly sampled users to whom they transmit me-
dia content.
communications technology in the interim—given, in fact,
that a service like Twitter was likely unimaginable at the
time—it is remarkable how well the theory agrees with our
observations.
5. WHO LISTENS TO WHAT?
The results in section 4 demonstrate the “elite” users ac-
count for a substantial portion of all of the attention on
Twitter, but also show clear differences in how the attention
is allocated to the different elite categories. It is therefore
interesting to consider what kinds of content is being shared
by these categories. Given the large number of URLs in our
observation period (260M), and the many different ways one
can classify content (video vs. text, news vs. entertainment,
political news vs. sports news, etc.), classifying even a small
fraction of URLs according to content is an onerous task.
Bakshy et al. [1], for example, used Amazon’s Mechanical
Turk to classify a stratified sample of 1,000 URLs along a
variety of dimensions; however, this method does not scale
well to larger sample sizes.
Instead, we restrict attention to URLs originated by the
New York Times which, with over 2.5M followers, is the
second-most-followed news organization on Twitter, after
CNN Breaking News. NY Times, however, is roughly ten
times as active as CNN Breaking News, so it is arguable a
better source of data. To classify NY Times content, we
exploit a convenient feature of their format—namely that
all NY Times URLs are classified in a consistent way by
the section in which they appear (e.g. U.S., World, Sports,
Science, Arts, etc.)
6
. Of the 6398 New York Times bit.ly
URLs we observed, 6370 could be successfully unshortened
and assigned to one of 21 categories. Of these, however, only
9 categories had more than 100 URLs during the observa-
tion period, one of which—“NY region”—was highly specific
to the New York metropolitan area; thus we focused our
attention on the remaining 8 topical categories. Figure 7
shows the proportion of URLs from each New York Times
section retweeted or reintroduced by each category. World
6
http://www.nytimes.com/year/month/day/category/
title.html?ref=category
User Category
% RTs and Re-introductions
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
1. World News
3. Business
5. Health
7. Science
blog celeb media org other
2. U.S. News
4. Sports
6. Technology
8. Arts
blog celeb media org other
Figure 7: Number of RTs and Reintroductions of
New York Times stories by content category
news is the most popular category, followed by U.S. News,
Business, and Sports, where increasingly niche categories
like Health, Arts, Science, and Technology are less popu-
lar still. In general, the overall pattern is replicated for all
categories of users, but there are some minor deviations: in
particular, organizations show disproportionately little in-
terest in business and arts-related stories, and dispropor-
tionately high interest in science, technology, and possibly
world news. Celebrities, by contrast, show greater interest
in sports and less interest in health, while the media shows
somewhat greater interest in U.S. news stories.
5.1 Lifespan of Content
In addition to different types of content, URLs introduced
by different types of elite users or ordinary users may exhibit
different lifespans, by which we mean the time lag between
the first and last appearance of a given URL on Twitter.
Naively, measuring lifespan seems a trivial matter; how-
ever, a finite observation period—which results in censoring
of our data—complicates this task. In other words, a URL
that is last observed towards the end of the observation pe-
riod may be retweeted or reintroduced after the period ends,
while correspondingly, a URL that is first observed toward
the beginning of the observation window may in fact have
been introduced before the window began. What we ob-
serve as the lifespan of a URL, therefore, is in reality a
lower bound on the lifespan. Although this limitation does
not create much of a problem for short-lived URLs—which
account for the vast majority of our observations—it does
δ
τ
estimation period = 133 days evaluation period = 90 days
Total observation window = 223 days
Figure 8: Schematic of lifespan estimation proce-
dure
create large biases for long lived URLs. In particular, URLs
that appear towards the end of our observation period will
be systematically classified as shorter-lived than URLs that
appear towards the beginning.
To address the censoring problem, we seek to determine
a buffer δ at both the beginning and the end of our 223-
day period, and only count URLs as having a lifespan of τ
if (a) they do not appear in the first δ days, (b) they first
appear in the interval between the buffers, and (c) they do
not appear in the last δ days, as illustrated in Figure 8(a).
To determine δ we first split the 223 day period into two
segments—the first 133 day estimation period and the last
90 day evaluation period (see Figure 8(b))—and then ask: if
we (a) observe a URL first appear in the first 133δ days and
(b) do not see it in the δ days prior to the splitting point, how
likely are we see it in the last 90 days? Clearly this depends
on the actual lifespan of the URL, as the longer a URL
lives, the more likely it will re-appear in the future. Using
this estimation/evaluation split, we find an upper-bound on
lifespan for which we can determine the actual lifespan with
95% accuracy as a function of δ. Finally, because we require
a beginning and ending buffer, and because we can only
classify a URL as having lifespan τ if it appears at least τ
days before the end of our window, we need to pick τ and
δ such that τ + 2δ 223. We determined that τ = 70
and δ = 70 sufficiently satisfied our constraints; thus for
the following analysis, we consider only URLs that have a
lifespan τ 70.
5.2 Lifespan By Category
Having established a method for estimating URL lifes-
pan, we now explore the lifespan of URLs introduced by
different categories of users, as shown in Figure 9(a). URLs
initiated by the elite categories exhibit a similar distribu-
tion over lifespan to those initiated by ordinary users. As
Figure 9(b) shows, however, when looking at the percent-
age of URLs of different lifespans initiated by each category,
we see two additional results: first, URLs originated by me-
dia actors generate a large portion of short-lived URLs (es-
pecially URLs with lifespan=0, those that only appeared
once); and second, URLs originated by bloggers are over-
represented among the longer-lived content. Both of these
results can be explained by the type of content that origi-
nates from different sources: whereas news stories tend to
be replaced by updates on a daily or more frequent basis,
the sorts of URLs that are picked up by bloggers are of more
persistent interest, and so are more likely to be retweeted or
reintroduced months or even years after their initial intro-
duction.
0 10 20 30 40 50 60 70
0
5
10
15
20
lifespan (day)
log(# of URLs with lifespan = x day)
other
celeb
media
org
blog
(a) Count
0 10 20 30 40 50 60 70
0
1
2
3
4
5
6
7
lifespan (day)
% of URLs from elites category
celeb
media
org
blog
(b) Percent
Figure 9: 9(a) Count and 9(b) percentage of URLs
initiated by 5 categories, with different lifespans
To shed more light on the nature of long-lived content on
Twitter, we used the bit.ly API service to unshorten 35K
of the most long-lived URLs (URLs that lived at least 200
days), and mapped them into 21034 web domains. As Figure
10 shows, the population of long-lived URLs is dominated by
videos, music, and books. Twitter is, in other words, should
be viewed as a subset of a much larger media ecosystem in
which content exists and is repeatedly rediscovered by Twit-
ter users. Some of this content—such as daily news stories—
has a relatively short period of relevance, after which a given
story is unlikely to be reintroduced or rebroadcast. At the
other extreme, classic music videos, movie clips, and long-
format magazine articles have lifespans that are effectively
unbounded, and can seemingly be rediscovered by Twitter
users indefinitely without losing relevance.
Two related points are illustrated by Figure 11, which
shows the average RT rate (the proportion of tweets con-
taining the URL that are retweets of another tweet) of URLs
with different lifespans, grouped by the categories that in-
troduced the URL
7
. First, for ordinary users, the majority
7
Note here that URLs with lifespan = 0 are those URLs
Figure 10: Top 20 domains for URLs that lived more
than 200 days
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
1.0
lifespan (day)
RT rate (# of RTs / total # of occurrences)
other
celeb
media
org
blog
Figure 11: Average RT rate by lifespan for each of
the originating categories
of appearances of URLs after the initial introduction derives
not from retweeting, but rather from reintroduction, where
this result is especially pronounced for long-lived URLs.
For the vast majority of URLs on Twitter, in other words,
longevity is determined not by diffusion, but by many dif-
ferent users independently rediscovering the same content,
consistent with our interpretation above. Second, however,
for URLs introduced by elite users, the result is somewhat
the opposite—that is, they are more likely to be retweeted
than reintroduced, even for URLs that persist for weeks.
Although it is unsurprising that elite users generate more
retweets than ordinary users, the size of the difference is
nevertheless striking, and suggests that in spite of the dom-
inant result above that content lifespan is determined to a
large extent by type, the source of its origin also impacts its
persistence, at least on average—a result that is consistent
with previous findings [1].
6. CONCLUSIONS
In this paper, we investigated a classic problem in me-
dia communications research, captured by the first part of
Laswell’s maxim—“who says what to whom”—in the context
that only appeared once in our dataset, thus the RT rate is
zero.
of Twitter. By restricting our attention to Twitter, our con-
clusions are necessarily limited to one narrow cross-section of
the media landscape. Moreover, communications on Twitter
may be unrepresentative of information flow via more tradi-
tional channels, such as TV and radio on the one hand, and
interpersonal interactions on the other hand. However, we
feel the advantages of using Twitter to answer this question
outweighed the limitations. First, because Twitter users ex-
plicitly opt-in to “follow” each other, and because Twitter
maintains a complete record of every tweet broadcast, it
provides an unprecedented level of resolution and coverage
regarding who is listening to whom. Second, because Twit-
ter users themselves classify other users by including them
on lists, Twitter effectively provides a ready-made, crowd-
sourced classification scheme of users.
By studying the flow of information among the five cat-
egories that we identified (media, celebrities, organizations,
bloggers, and ordinary), our analysis sheds new light on
some old questions of communications research. First, we
find that although audience attention has indeed fragmented
among a wider pool of content producers than classical mod-
els of mass media, attention remains highly concentrated,
where roughly 0.05% of the population accounts for almost
half of all attention. Within the population of elite users,
moreover, attention is highly homophilous, with celebrities
following celebrities, media following media, and bloggers
following bloggers. Second, we find considerable support for
the two-step flow of information—almost half the informa-
tion that originates from the media passes to the masses indi-
rectly via a diffuse intermediate layer of opinion leaders, who
although classified as ordinary users, are more connected
and more exposed to the media than their followers. Third,
we find that although all categories devote a roughly simi-
lar fraction of their attention to different categories of news
(World, U.S., Business, etc), there are some differences—
organizations, for example, devote a surprisingly small frac-
tion of their attention to business-related news. We also find
that different types of content exhibit very different lifes-
pans. In particular, media-originated URLs are dispropor-
tionately represented among short-lived URLs while those
originated by bloggers tend to be overrepresented among
long-lived URLs. Finally, we find that the longest-lived
URLs are dominated by content such as videos and music,
which are continually being rediscovered by Twitter users
and appear to persist indefinitely.
In closing, we note that although our use of Twitter lists
to label users was motivated by a specific set of questions
regarding mass vs interpersonal communications, and that
for this reason we have focused on a limited set of predeter-
mined user-categories, it would also be interesting to explore
automatic classification schemes from which additional user
categories could emerge. In particular, such an approach
would allow one to examine the category of opinion lead-
ers in more detail, possibly identifying opinion leaders for
different topics, as has been proposed elsewhere [14]. In
addition, another area for future work would be to extract
content information in a more systematic manner, shedding
more light on the “what” element of Lasswell’s maxim. And
finally, a significant challenge for future work is to merge
the data regarding information flow on Twitter with other
sources of outcome data—relating, for example, to opinions
or actions that would engage more directly with the “effects”
component of Lasswell’s maxim.
7. REFERENCES
[1] E. Bakshy, J. M. Hofman, A. Mason, Winter, and
D. J. Watts. Identifying ‘influencers’ on twitter. In
Fourth ACM International Conference on Web Seach
and Data Mining (WSDM), Hong Kong, 2011. ACM.
[2] W. L. Bennett and S. Iyengar. A new era of minimal
effects? the changing foundations of political
communication. Journal of Communication,
58(4):707–731, 2008.
[3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.
Gummad. Measuring user influence on twitter: The
million follower fallacy. In 4th Int’l AAAI Conference
on Weblogs and Social Media, Washington, DC, 2010.
[4] J. S. Coleman, E. Katz, and H. Menzel. The diffusion
of an innovation among physicians. Sociometry,
20(4):253–270, 1957.
[5] T. Gitlin. Media sociology: The dominant paradigm.
Theory and Society, 6(2):205–253, 1978.
[6] E. Katz and P. F. Lazarsfeld. Personal influence; the
part played by people in the flow of mass
communications. Free Press, Glencoe, Ill.
1955.
[7] G. Kossinets and D. J. Watts. Empirical analysis of an
evolving social network. Science, 311(5757):88–90,
2006.
[8] H. Kwak, C. Lee, H. Park, and S. Moon. What is
twitter, a social network or a news media? In
Proceedings of the 19th international conference on
World Wide Web, pages 591–600. ACM, 2010.
[9] H. D. Lasswell. The structure and function of
communication in society. In L. Bryson, editor, The
Communication of Ideas, pages 117–130. University of
Illinois Press, Urbana, IL, 1948.
[10] P. F. Lazarsfeld, B. Berelson, and H. Gaudet. The
people’s choice; how the voter makes up his mind in a
presidential campaign. Columbia University Press,
New York, 3rd edition, 1968.
[11] R. K. Merton. Patterns of influence: Local and
cosmopolitan influentials. In R. K. Merton, editor,
Social theory and social structure, pages 441–474. Free
Press, New York, 1968.
[12] J. Surowiecki. The Wisdom of Crowds : Why the many
are smarter than the few and how collective wisdom
shapes business, economies, societies, and nations.
Doubleday, New York, 1st edition, 2004. 2003070095
James Surowiecki. Includes bibliographical references.
[13] J. B. Walther, C. T. Carr, S. S. W. Choi, D. C.
DeAndrea, J. Kim, S. T. Tong, and B. Van Der Heide.
Interaction of interpersonal, peer, and media influence
sources online. In Z. Papacharissi, editor, A Networked
Self: Identity, Community, and Culture on Social
Network Sites, pages 17–38. Routledge, 2010.
[14] G. Weimann. The Influentials: People Who Influence
People. State University of New York Press, Albany,
NY, 1994.
[15] J. Weng, E. P. Lim, J. Jiang, and Q. He. Twitterrank:
finding topic-sensitive influential twitterers. In
Proceedings of the third ACM international conference
on Web search and data mining, pages 261–270. ACM,
2010.