Daejin Choi, Jinyoung Han, Taejoong Chung, Yong-Yeol Ahn, Byung-Gon Chun, Ted "Taekyoung" Kwon
ACM Conference on Online Social Networks (COSN) 2015, Standford, CA, USA, November 2015.

 It becomes the norm for people to communicate with one another through various online social channels, where different conversation structures are formed depending on platforms. One of the common online communication patterns is a threaded conversation where a user brings up a conversation topic, and then other people respond to the initiator or other participants by commenting, which can be modeled as a tree structure. This paper seeks to investigate (i) the characteristics of online threaded conversations in terms of volume, responsiveness, and virality and (ii) what and how content properties and user participation behaviors are associated with such characteristics. To this end, we collect 700 K threaded conversations from 1.5 M users in Reddit, one of the most popular online communities allowing people to communicate with others in the form of threaded conversations. Using the collected dataset, we find that "social" words, difficulties of texts, and document relevancy are associated with the volume, responsiveness, and virality of conversations. We also discover that large/viral conversations are mostly formed by a small portion of users who are reciprocally communicate with others by analyzing user interactions. Our analysis on discovering user roles in conversations reveal that users who are interested in multiple topics play important roles in large and viral conversations, whereas heavily-posting users play important roles in responsive conversations. We expand our analysis to topical communities (i.e., subreddits) and find that news-related, image-based, and discussion-related communities are more likely to have large, responsive, and viral conversations, respectively.

[PDF Link]
 author = {Choi, Daejin and Han, Jinyoung and Chung, Taejoong and Ahn, Yong-Yeol and Chun, Byung-Gon and Kwon, Ted Taekyoung},
 title = {Characterizing Conversation Patterns in Reddit: From the Perspectives of Content Properties and User Participation Behaviors},
 booktitle = {Proceedings of the 2015 ACM on Conference on Online Social Networks},
 series = {COSN '15},
 year = {2015},
 isbn = {978-1-4503-3951-3},
 location = {Palo Alto, California, USA},
 pages = {233--243},
 numpages = {11},
 url = {http://doi.acm.org/10.1145/2817946.2817959},
 doi = {10.1145/2817946.2817959},
 acmid = {2817959},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {comment, online communication, reddit, subreddits, threaded conversation, user behavior, virality},

Measurement Framework

 To retrieve posts and associated comments, we developed our measurement system for data collection and analysis as shown in Figure below. The measurement system consists of three parts: (i) Reddit interface module, (ii) core module, and (iii) DB module. The Reddit interface module communicates with Reddit.com through the APIs3 provided by Reddit. We utilize `Python Reddit API Wrapper (PRAW)', which is a popular Reddit API wrapper package written in Python.

  To monitor the posts and their follow-up comments, we developed two key submodules in the core module: the post observer and comment observer. Once in every minute, the post observer monitors and fetches all new posts in each subreddit. At the time of our data collection, Reddit APIs provided up to 1,000 recent posts in each subreddit in the chronological order; hence our crawler fetches up to 1,000 posts every minute not to miss newly-uploaded posts. Whenever the post observer identifies a new post, the comment observer begins to keep track of all the comments relevant to the post. Similarly, the comment observer monitors and collect every comment associated with the posts that we have fetched. We collected every single post and comment during our measurement period since the observed maximum number of messages per minute was 722, which did not exceed the collected message limit of the Reddit API.


 The collected dataset is stored in the DB module. We decide to choose data only from the top 100 subreddits in terms of the number of subscribers, which account for more than 60% of all subscribers (out of 378,293 subreddits, as of Oct. 22, 2014) in Reddit. We collected the dataset for 35 days from March 13 to April 18, 2014, which contains 1,016,342 posts and 18,626,530 comments, shared by 1,531,247 users. We then extracted 695,857 (68.5%) posts that each have at least one comment, and their 18,093,422 comments; posts and comments are written by 1,455,293 users. Each post contains the author id, title, subreddit id, and timestamp, while each comment contains the original post id, user id, comment text, and a parent from which the comment is generated. The parent can be a comment or a post.
* Data is only available on a condition that the paper listed above is cited by your work.


  1. Seoul National University
  2. Network Convergence & Security Laboratory