Who Says What to Whom on Twitter

Shaomei Wu

∗

Cornell University, USA

sw475@cornell.edu

Jake M. Hofman

Yahoo! Research, NY, USA

hofman@yahoo-inc.com

Winter A. Mason

Yahoo! Research, NY, USA

winteram@yahoo-

inc.com

Duncan J. Watts

Yahoo! Research, NY, USA

djw@yahoo-inc.com

ABSTRACT

We study several longstanding questions in media communi-

cations research, in the context of the microblogging service

Twitter, regarding the production, ﬂow, and consumption

of information. To do so, we exploit a recently introduced

feature of Twitter—known as Twitter lists—to distinguish

between elite users, by which we mean speciﬁcally celebri-

ties, bloggers, and representatives of media outlets and other

formal organizations, and ordinary users. Based on this clas-

siﬁcation, we ﬁnd a striking concentration of attention on

Twitter—roughly 50% of tweets consumed are generated by

just 20K elite users—where the media produces the most in-

formation, but celebrities are the most followed. We also ﬁnd

signiﬁcant homophily within categories: celebrities listen to

celebrities, while bloggers listen to bloggers etc; however,

bloggers in general rebroadcast more information than the

other categories. Next we re-examine the classical “two-step

ﬂow” theory of communications, ﬁnding considerable sup-

port for it on Twitter, but also some interesting diﬀerences.

Third, we ﬁnd that URLs broadcast by diﬀerent categories

of users or containing diﬀerent types of content exhibit sys-

tematically diﬀerent lifespans. And ﬁnally, we examine the

attention paid by the diﬀerent user categories to diﬀerent

news topics.

Categories and Subject Descriptors

H.1.2 [Models and Principles]: User/Machine Systems;

J.4 [Social and Behavioral Sciences]: Sociology

General Terms

two-step ﬂow, communications, classiﬁcation

Keywords

Communication networks, Twitter, information ﬂow

∗

Part of this research was performed while the author was

visiting Yahoo! Research, New York.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

WWW ’11 Hyderabad, India

1. INTRODUCTION

A longstanding objective of media communications re-

search is encapsulated by what is known as Lasswell’s maxim:

“who says what to whom in what channel with what ef-

fect” [9], so-named for one of the pioneers of the ﬁeld, Harold

Lasswell. Although simple to state, Laswell’s maxim has

proven diﬃcult to satisfy in the more-than 60 years since

he stated it, in part because it is generally diﬃcult to ob-

serve information ﬂows in large populations, and in part

because diﬀerent channels have very diﬀerent attributes and

eﬀects. As a result, theories of communications have tended

to focus either on “mass” communication, deﬁned as “one-

way message transmissions from one source to a large, rela-

tively undiﬀerentiated and anonymous audience,” or on “in-

terpersonal” communication, meaning a “two-way message

exchange between two or more individuals.” [13].

Correspondingly, debates among communication theorists

have tended to revolve around the relative importance of

these two putative modes of communication. For exam-

ple, whereas early theories such as the so-called“hypodermic

model” posited that mass media exerted direct and relatively

strong eﬀects on public opinion, mid-century researchers [10,

6, 11, 4] argued that the mass media inﬂuenced the pub-

lic only indirectly, via what they called a two-step ﬂow of

communications, where the critical intermediate layer was

occupied by a category of media-savvy individuals called

opinion leaders. The resulting “limited eﬀects” paradigm

was then subsequently challenged by a new generation of

researchers [5], who claimed that the real importance of the

mass media lay in its ability to set the agenda of public

discourse. But in recent years rising public skepticism of

mass media, along with changes in media and communica-

tion technology, have tilted conventional academic wisdom

once more in favor of interpersonal communication, which

some identify as a “new era” of minimal eﬀects [2].

Recent changes in technology, however, have increasingly

undermined the validity of the mass vs. interpersonal di-

chotomy itself. On the one hand, over the past few decades

mass communication has experienced a proliferation of new

channels, including cable television, satellite radio, special-

ist book and magazine publishers, and of course an array of

web-based media such as sponsored blogs, online communi-

ties, and social news sites. Correspondingly, the traditional

mass audience once associated with, say, network television

has fragmented into many smaller audiences, each of which

increasingly selects the information to which it is exposed,

and in some cases generates the information itself. Mean-

while, in the opposite direction interpersonal communication

has become increasingly ampliﬁed through personal blogs,

email lists, and social networking sites to aﬀord individu-

als ever-larger audiences. Together, these two trends have

greatly obscured the historical distinction between mass and

interpersonal communications, leading some scholars to refer

instead to “masspersonal” communications [13].

Nowhere is the erosion of traditional categories more ap-

parent than in the micro-blogging platform Twitter. To il-

lustrate, the top ten most-followed users on Twitter are not

corporations or media organizations, but individual people,

mostly celebrities. Moreover, these individuals communi-

cate directly with their millions of followers, often managed

by themselves or publicists, thus bypassing the traditional

intermediation of the mass media between celebrities and

fans. Next, in addition to conventional celebrities, a new

class of“semi-public”individuals like bloggers, authors, jour-

nalists, and subject matter experts have come to occupy an

important niche on Twitter, in some cases becoming more

prominent than traditional public ﬁgures such as entertain-

ers and elected oﬃcials. Third, in spite of these shifts away

from centralized media power, media organizations—along

with corporations, governments, and NGOs—all remain well

represented among highly followed users, and are often ex-

tremely active. And ﬁnally, Twitter is primarily made up

of many millions of users who seem to be ordinary individ-

uals communicating with their friends and acquaintances in

a manner largely consistent with traditional notions of in-

terpersonal communication.

Twitter, therefore, represents the full spectrum of commu-

nications from personal and private to “masspersonal”to tra-

ditional mass media. Consequently it provides an interesting

context in which to address Lasswell’s maxim, especially as

Twitter—unlike television, radio, and print media—enables

one to easily observe information ﬂows among the members

of its ecosystem. Unfortunately, however, the kinds of ef-

fects that are of most interest to communications theorists,

such as changes in behavior, attitudes, etc., remain diﬃcult

to measure on Twitter. Therefore in this paper we limit

our focus to the “who says what to whom” part of Laswell’s

maxim.

To this end, our paper makes three main contributions:

• We introduce a method for classifying users using Twit-

ter Lists into “elite” and “ordinary” users, further clas-

sifying elite users into one of four categories of interest—

media, celebrities, organizations, and bloggers.

• We investigate the ﬂow of information among these

categories, ﬁnding that although audience attention is

highly concentrated on a minority of elite users, much

of the information they produce reaches the masses

indirectly via a large population of intermediaries.

• We ﬁnd that diﬀerent categories of users place slightly

diﬀerent emphasis on diﬀerent types of content, and

that diﬀerent content types exhibit dramatically dif-

ferent characteristic lifespans, ranging from less than

a day to months.

The remainder of the paper proceeds as follows. In the

next section, we review related work. In section 3 we dis-

cuss our data and methods, including section 3.3 in which

we describe how we use Twitter Lists to classify users, out-

line two diﬀerent sampling methods, and show that they

deliver qualitatively similar results. In section 4 we analyze

the production of information on Twitter, particularly who

pays attention to whom. In section 4.1, we revisit the the-

ory of the two-step ﬂow—arguably the dominant theory of

communications for much of the past 50 years—ﬁnding con-

siderable support for the theory as well as some interesting

diﬀerences. In section 5, we consider “who listens to what”,

examining ﬁrst who shares what kinds of media content, and

second the lifespan of URLs as a function of their origin and

their content. Finally, in section 6 we conclude with a brief

discussion of future work.

2. RELATED WORK

Aside from the communications literature surveyed above,

a number of recent papers have examined information dif-

fusion on Twitter. Kwak et al. [8] studied the topological

features of the Twitter follower graph, concluding from the

highly skewed nature of the distribution of followers and the

low rate of reciprocated ties that Twitter more closely resem-

bled an information sharing network than a social network—

a conclusion that is consistent with our own view. In ad-

dition, Kwak et al. compared three diﬀerent measures of

inﬂuence—number of followers, page-rank, and number of

retweets—ﬁnding that the ranking of the most inﬂuential

users diﬀered depending on the measure. In a similar vein,

Cha et al. [3] compared three measures of inﬂuence—number

of followers, number of retweets, and number of mentions—

and also found that the most followed users did not neces-

sarily score highest on the other measures. Weng et al. [15]

compared number of followers and page rank with a modiﬁed

page-rank measure which accounted for topic, again ﬁnding

that ranking depended on the inﬂuence measure. Finally,

Bakshy et al. [1] studied the distribution of retweet cascades

on Twitter, ﬁnding that although users with large follower

counts and past success in triggering cascades were on aver-

age more likely to trigger large cascades in the future, these

features are in general poor predictors of future cascade size.

Our paper diﬀers from this earlier work by shifting atten-

tion from the ranking of individual users in terms of various

inﬂuence measures to the ﬂow of information among dif-

ferent categories of users. In particular, we are interested

in identifying “elite” users, who we diﬀerentiate from “ordi-

nary” users in terms of their visibility, and understanding

their role in introducing information into Twitter, as well as

how information originating from traditional media sources

reaches the masses.

3. DATA AND METHODS

3.1 Twitter Follower Graph

In order to understand how information is transmitted on

Twitter, we need to know the channels by which it ﬂows;

that is, who is following whom on Twitter. To this end, we

used the follower graph studied by Kwak et al. [8], which

included 42M users and 1.5B edges. This data represents

a crawl of the graph seeded with all users on Twitter as

observed by July 31st, 2009, and is publicly available

. As

reported by Kwak et al. [8], the follower graph is a directed

network characterized by highly skewed distributions both

The data is free to download from

http://an.kaist.ac.kr/traces/WWW2010.html

of in-degree (# followers) and out-degree (#“friends”, Twit-

ter notation for how many others a user follows); however,

the out-degree distribution is even more skewed than the

in-degree distribution. In both friend and follower distribu-

tions, for example, the median is less than 100, but the max-

imum # friends is several hundred thousand, while a small

number of users have millions of followers. In addition, the

follower graph is also characterized by extremely low reci-

procity (roughly 20%)—in particular, the most-followed in-

dividuals typically do not follow many others. The Twitter

follower graph, in other words, does not conform to the usual

characteristics of social networks, which exhibit much higher

reciprocity and far less skewed degree distributions [7], but

instead resembles more the mixture of one-way mass com-

munications and reciprocated interpersonal communications

described above.

3.2 Twitter Firehose

In addition to the follower graph, we are interested in the

content being shared on Twitter—particularly URLs—and

so we examined the corpus of all 5B tweets generated over

a 223 day period from July 28, 2009 to March 8, 2010 us-

ing data from the Twitter “ﬁrehose,” the complete stream

of all tweets

. Because our objective is to understand the

ﬂow of information, it is useful for us to restrict attention to

tweets containing URLs, for two reasons. First, URLs add

easily identiﬁable tags to individual tweets, allowing us to

observe when a particular piece of content is either retweeted

or subsequently reintroduced by another user. And second,

because URLs point to online content outside of Twitter,

they provide a much richer source of variation than is pos-

sible in the typical 140 character tweet. Finally, we note

that almost all URLs broadcast on Twitter have been short-

ened using one of a number of URL shorteners, of which the

most popular is http://bit.ly/. From the total of 5B tweets

recorded during our observation period, therefore, we focus

our attention on the subset of 260M containing bit.ly URLs.

3.3 Twitter Lists

Our method for classifying users exploits a relatively re-

cent feature of Twitter: Twitter Lists. Since its launch on

November 2, 2009, Twitter Lists have been welcomed by the

community as a way to group people and organize one’s in-

coming stream of tweets by speciﬁc sets of users. To create

a Twitter List, a user needs to provide a name (required)

and description (optional) for the list, and decide whether

the new list is public (anyone can view and subscribe to this

list) or private (only the list creator can view or subscribe to

this list). Once a list is created, the user can add/edit/delete

list members. As the purpose of Twitter Lists is to help users

organize users they follow, the name of the list can be con-

sidered a meaningful label for the listed users. List creation

therefore eﬀectively exploits the “wisdom of crowds” [12]

to the task of classifying users, both in terms of their im-

portance to the community (number of lists on which they

appear), and also how they are perceived (e.g. news organi-

zation vs. celebrity, etc.).

Before describing our methods for classifying users in terms

of the lists on which they appear, we emphasize that we

are motivated by a particular set of substantive questions

arising out of communications theory. In particular, we

http://dev.twitter.com/doc/get/statuses/ﬁrehose

are interested in the relative importance of mass commu-

nications, as practiced by media and other formal organiza-

tions, masspersonal communications as practiced by celebri-

ties and prominent bloggers, and interpersonal communica-

tions, as practiced by ordinary individuals communicating

with their friends. In addition, we are also interested in the

relationships between these categories of users, motivated

by theoretical arguments such as the theory of the two-step

ﬂow [6]. Rather than pursuing a strategy of automatic clas-

siﬁcation, therefore, our approach depends on deﬁning and

identifying certain predetermined classes of theoretical in-

terest, where both approaches have advantages and disad-

vantages. In particular, we restrict our attention to four

classes of what we call “elite” users: media, celebrities, orga-

nizations, and bloggers, as well as the relationships between

these elite users and the much larger population of “ordi-

nary” users.

In additional to these theoretically-imposed constraints,

our proposed classiﬁcation method must also satisfy a prac-

tical constraint—namely that the rate limits established by

Twitter’s API eﬀectively preclude crawling all lists for all

Twitter users

. Thus we instead devised two diﬀerent sam-

pling schemes—a snowball sample and an activity sample—

each with some advantages and disadvantages, discussed be-

low.

3.3.1 Snowball sample of Twitter Lists

The ﬁrst method for identifying elite users employed snow-

ball sampling. For each category, we chose a number u

seed users that were highly representative of the desired cat-

egory and appeared on many category-related lists. For each

of the four categories above, the following seeds were chosen:

• Celebrities: Barack Obama, Lady Gaga, Paris Hilton

• Media: CNN, New York Times

• Organizations: Amnesty International, World Wildlife

Foundation, Yahoo! Inc., Whole Foods

• Blogs

: BoingBoing, FamousBloggers, problogger, mash-

able. Chrisbrogan, virtuosoblogger, Gizmodo, Ileane,

dragonblogger, bbrian017, hishaman, copyblogger, en-

gadget, danielscocco, BlazingMinds, bloggersblog, Ty-

coonBlogger, shoemoney, wchingya, extremejohn,

GrowMap, kikolani, smartbloggerz, Element321, bran-

donacox, remarkablogger, jsinkeywest, seosmarty, No-

tAProBlog, kbloemendaal, JimiJones, ditesco

After reviewing the lists associated with these seeds, the

following keywords were hand-selected based on (a) their

representativeness of the desired categories; and (b) their

lack of overlap between categories:

The Twitter API allows only 20K calls per hour, where at

most 20 lists can be retrieved for each API call. Under the

modest assumption of 40M users (roughly the number in the

2009 crawl by [8]), where each user is included on at most

20 lists, this would require 4 ∗ 10

/2 ∗ 10

= 2, 000 hours, or

11 weeks. Clearly this time could be reduced by deploying

multiple accounts, but it also likely underestimates the real

time quite signiﬁcantly, as many users appear on many more

than 20 lists (e.g. Lady Gaga appears on nearly 140,000)

The blogger category required many more seeds because

bloggers are in general lower proﬁle than the seeds for the

other categories

Figure 1: Schematic of the Snowball Sampling

Method

• Celebrities: star, stars, hollywood, celebs, celebrity,

celebrities, celebsveriﬁed, celebrity-list,celebrities-on-

twitter, celebrity-tweets

• Media: news, media, news-media

• Organizations: company, companies, organization,

organisation, organizations, organisations, corporation,

brands, products, charity, charities, causes, cause, ngo

• Blogs: blog, blogs, blogger, bloggers

Having selected the seeds and the keywords for each cate-

gory, we then performed a snowball sample of the bipartite

graph of users and lists (see Figure 1). For each seed, we

crawled all lists on which that seed appeared. The resulting

“list of lists” was then pruned to contain only the l

lists

whose names matched at least one of the chosen keywords

for that category. For instance, Lady Gaga is on lists called

“faves”, “celebs”, and “celebrity”, but only the latter two lists

would be kept after pruning. We then crawled all u

users

appearing in the pruned “list of lists” (for instance, ﬁnd-

ing all users that appeared in the “celebrity” list with Lady

Gaga), and then repeated these last two steps to complete

the crawl. In total, 524, 116 users were obtained, who ap-

peared on 7, 000, 000 lists; however, many of the more promi-

nent users appeared on lists in more than one category—for

example Oprah Winfrey is frequently included in lists of

“celebrity” as well as “media.” To resolve this ambiguity, we

computed a user i’s membership score in category c:

where n

is the number of lists in category c that contain

user i and N

is the total number of lists in category c.

We then assigned each user to the category in which he

or she has the highest membership score. The number of

users assigned in this manner to each category is reported

in Table 1.

3.3.2 Activity Sample of Twitter Lists

Although the snowball sampling method is convenient and

is easily interpretable with respect to our theoretical moti-

vation, it is also potentially biased by our particular choice

of seeds. To address this concern, we also generate a sample

of users based on their activity. Speciﬁcally, we crawl all

lists associated with all users who tweet at least once every

week for our entire observation period.

This “activity-based” sampling method is also clearly bi-

ased towards users who are consistently active. Importantly,

Table 1: Distribution of users over categories

Snowball Sample Activity Sample

category # of users % of users # of users % of users

celeb 82,770 15.8% 14,778 13.0%

media 216,010 41.2% 40,186 35.3%

org 97,853 18.7% 14,891 13.1%

blog 127,483 24.3% 43,830 38.6%

total 524,116 100% 113,685 100%

however, the bias is likely to be quite diﬀerent from any in-

troduced by the snowball sample; thus obtaining similar re-

sults from the two samples should give us conﬁdence that our

ﬁndings are not artifacts of the sampling procedure. This

method initially yielded 750k users and 5M lists; however,

after pruning the lists to those that contained at least of the

keywords above, and assigning users to unique categories

(as described above), we obtained a much-reduced sample

of 113,685 users, where Table 1 reports the number of users

assigned to each category. We note that the number of lists

obtained by the activity sampling methods is considerably

smaller than that obtained by the snowball sample, and

that bloggers are more heavily represented among the ac-

tivity sample at the expense of the other three categories—

consistent with our claim that the two methods introduce

diﬀerent biases. Interestingly, however, 97,614 of the ac-

tivity sample, or 85%, also appear in the snowball sample,

suggesting that the two sampling methods identifying sim-

ilar populations of elite users–as indeed we conﬁrm in the

next section.

3.3.3 Classifying Elite Users

In order to identify categories of elite users, we not only

need to classify users into categories, but also arrive at a def-

inition of “elite” that satisﬁes a tradeoﬀ between (a) keeping

each category relatively small, so as not to include users who

are not distinguishable from ordinary users, while (b) maxi-

mizing the volume of attention that is accounted for by each

category. In addition, it is also desirable to make the four

categories the same size, so as to facilitate comparisons. To

this end, we ﬁrst rank all users in each of category by how

frequently they are listed in that category. Next, we mea-

sure the ﬂow of information from the top k users in each

of the four categories to a random sample of 100K ordinary

(i.e. unclassiﬁed) users in two ways: the proportion of peo-

ple the user follows in each category, and the proportion of

tweets the user received from everyone the user follows in

each category.

Figure 2(a) shows for the snowball sample the share of

following links (square symbols) and tweets received (dia-

monds) by an average user, while Figure 2(b) shows the

same information for the activity sample. Although the nu-

merical values diﬀer slightly, the two sets of results are qual-

itatively similar. In particular, for both sampling methods,

celebrities outrank all other categories, followed by the me-

dia, organizations, and bloggers. Also in both cases, the

bulk of the attention is accounted for by a relatively small

number of users within each category, as evidenced by the

relatively ﬂat slope of the attention curves, where we note

that the curve for celebrities asymptotes more slowly than

for the other three categories. Balancing the requirements

described above, therefore, we chose k = 5000 as a cut-oﬀ

for the elite categories, where all remaining users are hence-

forth classiﬁed as ordinary. In addition, from this point on,

we restrict our analysis to elite categories to the top 5,000

users identiﬁed by the sampling method, noting that both

methods generate similar results.

0 10 20 30

celebrities

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

media

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

organizations

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

blogs

top k

average %

1000 4000 7000 10000

friends

tweets received

(a) Snowball sample

0 10 20 30

celebrities

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

media

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

organizations

top k

average %

1000 4000 7000 10000

friends

tweets received

0 10 20 30

blogs

top k

average %

1000 4000 7000 10000

friends

tweets received

(b) Activity sample

Figure 2: Average fraction of # following (blue line)

and # tweets (red line) for a random user that are

accounted for by the top K elites users crawled

Based on this deﬁnition of elite users, Table 2 shows that

although ordinary users collectively introduce by far the

highest number of URLs, members of the elite categories are

far more active on a per-capita basis. In particular, users

classiﬁed as “media” easily outproduce all other categories,

followed by bloggers, organizations, and celebrities. Ordi-

nary users originate on average only about 6 URLs each,

compared with over 1,000 for media users. In the rest of

this paper, therefore, when we talk about “celebrity”, “me-

dia”, “organization”, “blog”, we refer the top 5K users drawn

from the snowball sample listed as “celebrity”, “media”, “or-

ganization”, “blog”, respectively.

Table 3, which shows the top 5 users in each of the four

categories, suggests that the sampling method yields re-

sults that are consistent with our objective of identifying

users who are prominent exemplars of our target categories.

Among the celebrity list, for example, “aplusk,” is the han-

Table 2: # of URLs initiated by category

# of URLs

category # of URLs per-capita

celeb 139,058 27.81

media 5,119,739 1023.94

org 523,698 104.74

blog 1,360,131 272.03

ordinary 244,228,364 6.10

dle for actor Ashton Kusher, one of the ﬁrst celebrities to

embrace Twitter and still one of the most followed users,

while the remain celebrity users—Lady Gaga, Ellen De-

generes, Oprah Winfrey, and Taylor Swift, are all household

names. In the media category, CNN Breaking News and the

New York Times are most prominent, followed by Breaking

News, Time, and Asahi, a leading Japanese daily newspa-

per. Among organizations, Google, Starbucks, and Twit-

ter are obviously large and socially prominent corporations,

while JoinRed is the charity organization started by Bono of

U2, and ollehkt is the Twitter account for KT, formerly Ko-

rean Telecom. Finally, among the blogging category, Mash-

able and ProBlogger are both prominent US blogging sites,

while Kibe Loco and Nao Salvo are popular blogs in Brazil,

and dooce is the blog of Heather Armstrong, a widely read

“mommy blogger” with over 1.5M followers.

Table 3: Top 5 users in each category

Celebrity Media Org Blog

aplusk cnnbrk google mashable

ladygaga nytimes Starbucks problogger

TheEllenShow asahi twitter kibeloco

taylorswift13 BreakingNews joinred naosalvo

Oprah TIME ollehkt dooce

4. “WHO LISTENS TO WHOM”

The results of the previous section provide qualiﬁed sup-

port for the conventional wisdom that audiences have be-

come increasingly fragmented. Clearly, ordinary users on

Twitter are receiving their information from many thou-

sands of distinct sources, most of which are not traditional

media organizations—even though media outlets are by far

the most active users on Twitter, only about 15% of tweets

received by ordinary users are received directly from the

media. Equally interesting, however, is that in spite of this

fragmentation, it remains the case that 20K elite users, com-

prising less than 0.05% of the user population, attracts al-

most 50% of all attention within Twitter. Even if the media

has lost attention relative to other elites, information ﬂows

have not become egalitarian by any means.

The prominence of elite users also raises the question of

how these diﬀerent categories listen to each other. To ad-

dress this issue, we compute the volume of tweets exchanged

between elite categories. Speciﬁcally, Figure 3 shows the

average percentage of tweets that category i receives from

category j, exhibiting striking homophily with respect to

attention: celebrities overwhelmingly pay attention to other

celebrities, media actors pay attention to other media ac-

tors, and so on. The one slight exception to this rule is that

organizations pay more attention to bloggers than to them-

Celeb Media

Org Blog

A B

Category of Twitter Users

B receive tweets from A

Figure 3: Share of tweets received among elite cat-

egories

Figure 4: RT behavior among elite categories

selves. In general, in fact, attention paid by organizations is

more evenly distributed across categories than for any other

category.

Figure 3, it should be noted, shows only how many URLs

are received by category i from category j, a particular weak

measure of attention for the simple reason that many tweets

go unread. A stronger measure of attention, therefore, is

to consider instead only those URLs introduced by category

i that are subsequently retweeted by category j. Figure 4

shows how much information originating from each category

is retweeted by other categories. As with our previous mea-

sure of attention, retweeting is strongly homophilous among

elite categories; however, bloggers are disproportionately re-

sponsible for retweeting URLs originated by all categories.

This result reﬂects the characterization of bloggers as recy-

clers and ﬁlters of information. However, even though on a

per-capita basis bloggers disproportionately occupy the role

of information recyclers—93 retweets per person, compared

to only 1.1 retweets per person for ordinary users—the total

number of URLs retweeted by bloggers (465k) is vastly out-

weighed by the number retweeted by ordinary users (46M);

thus their overall impact is relatively minimal.

4.1 Two-Step Flow of Information

Examining information ﬂow on Twitter can also shed new

light on the theory of the two-step ﬂow, arguably the theory

that has most successfully captured the dueling importance

of mass media and interpersonal inﬂuence. The essence of

the two-step ﬂow is that information passes from the media

to the masses not directly, as supposed by early theories of

mass communication, but passes ﬁrst through an intermedi-

ate layer of “opinion leaders” who decide which information

to rebroadcast to their followers, and which to ignore. As

we have already noted, on Twitter the ﬂow of information

to the masses from the media accounts for only a fraction

of the total volume of information. Nevertheless, it is still a

substantial fraction, so it is still interesting to ask: for the

special case of information originating from media sources,

what proportion is broadcast directly to the masses, and

what proportion is transmitted indirectly via some popula-

tion of intermediaries? In addition, we may inquire whether

these intermediaries, to the extent they exist, are drawn

from other elite categories or from ordinary users, as claimed

by the two-step ﬂow theory; and if the latter, in what re-

spects they diﬀer from other ordinary users.

Before proceeding with this analysis, we note that there

are two ways information can pass through an intermedi-

ary in Twitter. The ﬁrst is via retweeting, which occurs

when a users explicitly rebroadcasts a URL that he or she

has received, along with an explicit acknowledgement of the

source—either using oﬃcial retweet function provided by

Twitter, or making use of an informal convention such as

“RT @user” or “via @user.” The second mechanism is what

we label reintroduction, where a user subsequently tweets a

URL that has previously been introduced by another user,

but without the acknowledgment, in which case we assume

the information has been rediscovered independently. For

the purposes of studying when a user receives information

directly from the media or indirectly through an intermedi-

ary, we treat retweets and reintroductions equivalently. If

the ﬁrst occurrence of a URL in Twitter came from a media

user, but a user received the URL from another source, then

that source can be considered an intermediary, whether they

are citing the source within Twitter by retweeting the URL,

or reintroducing it, having discovered the URL outside of

Twitter.

To quantify the extent to which ordinary users get their

information indirectly versus directly from the media, we

sampled 1M random ordinary users

, and for each user,

counted the number n of bit.ly URLs they had received that

had originated from one of our 5K media users, where of

the 1M total, 600K had received at least one such URL.

For each member of this 600K subset we then counted the

number n

of these URLs that they received via non-media

friends; that is, via a two-step ﬂow. The average fraction

/n = 0.46 therefore represents the proportion of media-

originated content that reaches the masses via an interme-

diary rather than directly. As Figure 5 shows, however,

this average is somewhat misleading. In reality, the pop-

ulation comprises two types—those who receive essentially

all of their media-originating information via two-step ﬂows

and those who receive virtually all of it directly from the

media. Unsurprisingly, the former type is exposed to less

total media than the latter. What is surprising, however, is

that even users who received up to 100 media URLs dur-

As before, performing this analysis for the entire population

of over 40M ordinary users proved to be computationally

unfeasible.

a b

Figure 5: Percentage of information that is received

via an intermediary as a function of total volume of

media content to which a user is exposed.

ing our observation period received all of them from opinion

leaders.

Who are these intermediaries, and how many of them are

there? In total, the population of intermediaries is smaller

than that of the users who rely on them, but still surprisingly

large, roughly 500K, the vast majority of which (96%) are

classiﬁed ordinary users, not elites. Interestingly, Figure 5c

also shows that at least some intermediaries also receive the

bulk of their media content indirectly, just like other ordi-

nary users. Comparing Figure 5a and 5c, however, we note

that intermediaries are not like other ordinary users in that

they are exposed to considerably more media than randomly

selected users, hence the number of intermediaries who rely

on two-step ﬂows is much smaller than for random users. In

addition, we ﬁnd that on average intermediaries have more

followers than randomly sampled users (543 followers versus

34) and are also more active (180 tweets on average, versus

7). Finally, Figure 6 shows that although all intermediaries,

by deﬁnition, pass along media content to at least one other

user, a minority satisﬁes this function for multiple users,

where we note that the most prominent intermediaries are

disproportionately drawn from the 4% elite users—Ashton

Kucher (asplusk), for example acts as an intermediary for

over 100,000 users.

Interestingly, these results are all broadly consistent with

the original conception of the two-step ﬂow, advanced over

50 years ago, which emphasized that opinion leaders were

“distributed in all occupational groups, and on every social

and economic level,” corresponding to our classiﬁcation of

most intermediaries as ordinary. [6]. The original theory

also emphasized that opinion leaders, like their followers,

also received at least some of their information via two-step

ﬂows, but that in general they were more exposed to the

media than their followers—just as we ﬁnd here. Finally,

the theory predicted that opinion leadership was not a bi-

nary attribute, but rather a continuously varying one, cor-

responding to our ﬁnding that intermediaries vary widely in

the number of users for whom they act as ﬁlters and trans-

mitters of media content. Given the length of time that has

elapsed since the theory of the two-step ﬂow was articulated,

and the transformational changes that have taken place in

1 100 10000

# of two−step recipients

# of opinion leaders

0 4 16 64 256 2048 16384 131072

Figure 6: Frequency of intermediaries binned by #

randomly sampled users to whom they transmit me-

dia content.

communications technology in the interim—given, in fact,

that a service like Twitter was likely unimaginable at the

time—it is remarkable how well the theory agrees with our

observations.

5. WHO LISTENS TO WHAT?

The results in section 4 demonstrate the “elite” users ac-

count for a substantial portion of all of the attention on

Twitter, but also show clear diﬀerences in how the attention

is allocated to the diﬀerent elite categories. It is therefore

interesting to consider what kinds of content is being shared

by these categories. Given the large number of URLs in our

observation period (260M), and the many diﬀerent ways one

can classify content (video vs. text, news vs. entertainment,

political news vs. sports news, etc.), classifying even a small

fraction of URLs according to content is an onerous task.

Bakshy et al. [1], for example, used Amazon’s Mechanical

Turk to classify a stratiﬁed sample of 1,000 URLs along a

variety of dimensions; however, this method does not scale

well to larger sample sizes.

Instead, we restrict attention to URLs originated by the

New York Times which, with over 2.5M followers, is the

second-most-followed news organization on Twitter, after

CNN Breaking News. NY Times, however, is roughly ten

times as active as CNN Breaking News, so it is arguable a

better source of data. To classify NY Times content, we

exploit a convenient feature of their format—namely that

all NY Times URLs are classiﬁed in a consistent way by

the section in which they appear (e.g. U.S., World, Sports,

Science, Arts, etc.)

. Of the 6398 New York Times bit.ly

URLs we observed, 6370 could be successfully unshortened

and assigned to one of 21 categories. Of these, however, only

9 categories had more than 100 URLs during the observa-

tion period, one of which—“NY region”—was highly speciﬁc

to the New York metropolitan area; thus we focused our

attention on the remaining 8 topical categories. Figure 7

shows the proportion of URLs from each New York Times

section retweeted or reintroduced by each category. World

http://www.nytimes.com/year/month/day/category/

title.html?ref=category

User Category

% RTs and Re-introductions

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1. World News

3. Business

5. Health

7. Science

blog celeb media org other

2. U.S. News

4. Sports

6. Technology

8. Arts

blog celeb media org other

Figure 7: Number of RTs and Reintroductions of

New York Times stories by content category

news is the most popular category, followed by U.S. News,

Business, and Sports, where increasingly niche categories

like Health, Arts, Science, and Technology are less popu-

lar still. In general, the overall pattern is replicated for all

categories of users, but there are some minor deviations: in

particular, organizations show disproportionately little in-

terest in business and arts-related stories, and dispropor-

tionately high interest in science, technology, and possibly

world news. Celebrities, by contrast, show greater interest

in sports and less interest in health, while the media shows

somewhat greater interest in U.S. news stories.

5.1 Lifespan of Content

In addition to diﬀerent types of content, URLs introduced

by diﬀerent types of elite users or ordinary users may exhibit

diﬀerent lifespans, by which we mean the time lag between

the ﬁrst and last appearance of a given URL on Twitter.

Naively, measuring lifespan seems a trivial matter; how-

ever, a ﬁnite observation period—which results in censoring

of our data—complicates this task. In other words, a URL

that is last observed towards the end of the observation pe-

riod may be retweeted or reintroduced after the period ends,

while correspondingly, a URL that is ﬁrst observed toward

the beginning of the observation window may in fact have

been introduced before the window began. What we ob-

serve as the lifespan of a URL, therefore, is in reality a

lower bound on the lifespan. Although this limitation does

not create much of a problem for short-lived URLs—which

account for the vast majority of our observations—it does

ﬁrst observation

of URL

last observation

of URL

estimation period = 133 days evaluation period = 90 days

Total observation window = 223 days

Figure 8: Schematic of lifespan estimation proce-

dure

create large biases for long lived URLs. In particular, URLs

that appear towards the end of our observation period will

be systematically classiﬁed as shorter-lived than URLs that

appear towards the beginning.

To address the censoring problem, we seek to determine

a buﬀer δ at both the beginning and the end of our 223-

day period, and only count URLs as having a lifespan of τ

if (a) they do not appear in the ﬁrst δ days, (b) they ﬁrst

appear in the interval between the buﬀers, and (c) they do

not appear in the last δ days, as illustrated in Figure 8(a).

To determine δ we ﬁrst split the 223 day period into two

segments—the ﬁrst 133 day estimation period and the last

90 day evaluation period (see Figure 8(b))—and then ask: if

we (a) observe a URL ﬁrst appear in the ﬁrst 133−δ days and

(b) do not see it in the δ days prior to the splitting point, how

likely are we see it in the last 90 days? Clearly this depends

on the actual lifespan of the URL, as the longer a URL

lives, the more likely it will re-appear in the future. Using

this estimation/evaluation split, we ﬁnd an upper-bound on

lifespan for which we can determine the actual lifespan with

95% accuracy as a function of δ. Finally, because we require

a beginning and ending buﬀer, and because we can only

classify a URL as having lifespan τ if it appears at least τ

days before the end of our window, we need to pick τ and

δ such that τ + 2δ ≤ 223. We determined that τ = 70

and δ = 70 suﬃciently satisﬁed our constraints; thus for

the following analysis, we consider only URLs that have a

lifespan τ ≤ 70.

5.2 Lifespan By Category

Having established a method for estimating URL lifes-

pan, we now explore the lifespan of URLs introduced by

diﬀerent categories of users, as shown in Figure 9(a). URLs

initiated by the elite categories exhibit a similar distribu-

tion over lifespan to those initiated by ordinary users. As

Figure 9(b) shows, however, when looking at the percent-

age of URLs of diﬀerent lifespans initiated by each category,

we see two additional results: ﬁrst, URLs originated by me-

dia actors generate a large portion of short-lived URLs (es-

pecially URLs with lifespan=0, those that only appeared

once); and second, URLs originated by bloggers are over-

represented among the longer-lived content. Both of these

results can be explained by the type of content that origi-

nates from diﬀerent sources: whereas news stories tend to

be replaced by updates on a daily or more frequent basis,

the sorts of URLs that are picked up by bloggers are of more

persistent interest, and so are more likely to be retweeted or

reintroduced months or even years after their initial intro-

duction.

●

0 10 20 30 40 50 60 70

lifespan (day)

log(# of URLs with lifespan = x day)

●

other

celeb

media

org

blog

(a) Count

0 10 20 30 40 50 60 70

lifespan (day)

% of URLs from elites category

celeb

media

org

blog

(b) Percent

Figure 9: 9(a) Count and 9(b) percentage of URLs

initiated by 5 categories, with diﬀerent lifespans

To shed more light on the nature of long-lived content on

Twitter, we used the bit.ly API service to unshorten 35K

of the most long-lived URLs (URLs that lived at least 200

days), and mapped them into 21034 web domains. As Figure

10 shows, the population of long-lived URLs is dominated by

videos, music, and books. Twitter is, in other words, should

be viewed as a subset of a much larger media ecosystem in

which content exists and is repeatedly rediscovered by Twit-

ter users. Some of this content—such as daily news stories—

has a relatively short period of relevance, after which a given

story is unlikely to be reintroduced or rebroadcast. At the

other extreme, classic music videos, movie clips, and long-

format magazine articles have lifespans that are eﬀectively

unbounded, and can seemingly be rediscovered by Twitter

users indeﬁnitely without losing relevance.

Two related points are illustrated by Figure 11, which

shows the average RT rate (the proportion of tweets con-

taining the URL that are retweets of another tweet) of URLs

with diﬀerent lifespans, grouped by the categories that in-

troduced the URL

. First, for ordinary users, the majority

Note here that URLs with lifespan = 0 are those URLs

Figure 10: Top 20 domains for URLs that lived more

than 200 days

●

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

1.0

lifespan (day)

RT rate (# of RTs / total # of occurrences)

●

other

celeb

media

org

blog

Figure 11: Average RT rate by lifespan for each of

the originating categories

of appearances of URLs after the initial introduction derives

not from retweeting, but rather from reintroduction, where

this result is especially pronounced for long-lived URLs.

For the vast majority of URLs on Twitter, in other words,

longevity is determined not by diﬀusion, but by many dif-

ferent users independently rediscovering the same content,

consistent with our interpretation above. Second, however,

for URLs introduced by elite users, the result is somewhat

the opposite—that is, they are more likely to be retweeted

than reintroduced, even for URLs that persist for weeks.

Although it is unsurprising that elite users generate more

retweets than ordinary users, the size of the diﬀerence is

nevertheless striking, and suggests that in spite of the dom-

inant result above that content lifespan is determined to a

large extent by type, the source of its origin also impacts its

persistence, at least on average—a result that is consistent

with previous ﬁndings [1].

6. CONCLUSIONS

In this paper, we investigated a classic problem in me-

dia communications research, captured by the ﬁrst part of

Laswell’s maxim—“who says what to whom”—in the context

that only appeared once in our dataset, thus the RT rate is

zero.

of Twitter. By restricting our attention to Twitter, our con-

clusions are necessarily limited to one narrow cross-section of

the media landscape. Moreover, communications on Twitter

may be unrepresentative of information ﬂow via more tradi-

tional channels, such as TV and radio on the one hand, and

interpersonal interactions on the other hand. However, we

feel the advantages of using Twitter to answer this question

outweighed the limitations. First, because Twitter users ex-

plicitly opt-in to “follow” each other, and because Twitter

maintains a complete record of every tweet broadcast, it

provides an unprecedented level of resolution and coverage

regarding who is listening to whom. Second, because Twit-

ter users themselves classify other users by including them

on lists, Twitter eﬀectively provides a ready-made, crowd-

sourced classiﬁcation scheme of users.

By studying the ﬂow of information among the ﬁve cat-

egories that we identiﬁed (media, celebrities, organizations,

bloggers, and ordinary), our analysis sheds new light on

some old questions of communications research. First, we

ﬁnd that although audience attention has indeed fragmented

among a wider pool of content producers than classical mod-

els of mass media, attention remains highly concentrated,

where roughly 0.05% of the population accounts for almost

half of all attention. Within the population of elite users,

moreover, attention is highly homophilous, with celebrities

following celebrities, media following media, and bloggers

following bloggers. Second, we ﬁnd considerable support for

the two-step ﬂow of information—almost half the informa-

tion that originates from the media passes to the masses indi-

rectly via a diﬀuse intermediate layer of opinion leaders, who

although classiﬁed as ordinary users, are more connected

and more exposed to the media than their followers. Third,

we ﬁnd that although all categories devote a roughly simi-

lar fraction of their attention to diﬀerent categories of news

(World, U.S., Business, etc), there are some diﬀerences—

organizations, for example, devote a surprisingly small frac-

tion of their attention to business-related news. We also ﬁnd

that diﬀerent types of content exhibit very diﬀerent lifes-

pans. In particular, media-originated URLs are dispropor-

tionately represented among short-lived URLs while those

originated by bloggers tend to be overrepresented among

long-lived URLs. Finally, we ﬁnd that the longest-lived

URLs are dominated by content such as videos and music,

which are continually being rediscovered by Twitter users

and appear to persist indeﬁnitely.

In closing, we note that although our use of Twitter lists

to label users was motivated by a speciﬁc set of questions

regarding mass vs interpersonal communications, and that

for this reason we have focused on a limited set of predeter-

mined user-categories, it would also be interesting to explore

automatic classiﬁcation schemes from which additional user

categories could emerge. In particular, such an approach

would allow one to examine the category of opinion lead-

ers in more detail, possibly identifying opinion leaders for

diﬀerent topics, as has been proposed elsewhere [14]. In

addition, another area for future work would be to extract

content information in a more systematic manner, shedding

more light on the “what” element of Lasswell’s maxim. And

ﬁnally, a signiﬁcant challenge for future work is to merge

the data regarding information ﬂow on Twitter with other

sources of outcome data—relating, for example, to opinions

or actions that would engage more directly with the “eﬀects”

component of Lasswell’s maxim.

7. REFERENCES

[1] E. Bakshy, J. M. Hofman, A. Mason, Winter, and

D. J. Watts. Identifying ‘inﬂuencers’ on twitter. In

Fourth ACM International Conference on Web Seach

and Data Mining (WSDM), Hong Kong, 2011. ACM.

[2] W. L. Bennett and S. Iyengar. A new era of minimal

eﬀects? the changing foundations of political

communication. Journal of Communication,

58(4):707–731, 2008.

[3] M. Cha, H. Haddadi, F. Benevenuto, and K. P.

Gummad. Measuring user inﬂuence on twitter: The

million follower fallacy. In 4th Int’l AAAI Conference

on Weblogs and Social Media, Washington, DC, 2010.

[4] J. S. Coleman, E. Katz, and H. Menzel. The diﬀusion

of an innovation among physicians. Sociometry,

20(4):253–270, 1957.

[5] T. Gitlin. Media sociology: The dominant paradigm.

Theory and Society, 6(2):205–253, 1978.

[6] E. Katz and P. F. Lazarsfeld. Personal inﬂuence; the

part played by people in the ﬂow of mass

communications. Free Press, Glencoe, Ill.

”

1955.

[7] G. Kossinets and D. J. Watts. Empirical analysis of an

evolving social network. Science, 311(5757):88–90,

2006.

[8] H. Kwak, C. Lee, H. Park, and S. Moon. What is

twitter, a social network or a news media? In

Proceedings of the 19th international conference on

World Wide Web, pages 591–600. ACM, 2010.

[9] H. D. Lasswell. The structure and function of

communication in society. In L. Bryson, editor, The

Communication of Ideas, pages 117–130. University of

Illinois Press, Urbana, IL, 1948.

[10] P. F. Lazarsfeld, B. Berelson, and H. Gaudet. The

people’s choice; how the voter makes up his mind in a

presidential campaign. Columbia University Press,

New York, 3rd edition, 1968.

[11] R. K. Merton. Patterns of inﬂuence: Local and

cosmopolitan inﬂuentials. In R. K. Merton, editor,

Social theory and social structure, pages 441–474. Free

Press, New York, 1968.

[12] J. Surowiecki. The Wisdom of Crowds : Why the many

are smarter than the few and how collective wisdom

shapes business, economies, societies, and nations.

Doubleday, New York, 1st edition, 2004. 2003070095

James Surowiecki. Includes bibliographical references.

[13] J. B. Walther, C. T. Carr, S. S. W. Choi, D. C.

DeAndrea, J. Kim, S. T. Tong, and B. Van Der Heide.

Interaction of interpersonal, peer, and media inﬂuence

sources online. In Z. Papacharissi, editor, A Networked

Self: Identity, Community, and Culture on Social

Network Sites, pages 17–38. Routledge, 2010.

[14] G. Weimann. The Inﬂuentials: People Who Inﬂuence

People. State University of New York Press, Albany,

NY, 1994.

[15] J. Weng, E. P. Lim, J. Jiang, and Q. He. Twitterrank:

ﬁnding topic-sensitive inﬂuential twitterers. In

Proceedings of the third ACM international conference

on Web search and data mining, pages 261–270. ACM,

2010.