SoDA Lab

Publications

137 papers · 67 featured · citation counts from Google Scholar

Topic
Method
Platform
Year

Shintaro Sakai, Jisun An, Migyeong Kang, Haewoon Kwak

ACL Workshop on Cross-Cultural Considerations in NLP (C3NLP), 2026 · 2026 · 3 cites

Clinical research finds Western depression patients tend to report psychological symptoms while Eastern patients report somatic ones. We test whether LLMs (now used in mental health) reproduce this cultural pattern. They mostly don't in English; prompting in Chinese, Japanese, or Hindi helps somewhat, but a strong culturally-invariant symptom hierarchy overrides the cultural cues.

Migyeong Kang, Jihyun Kim, Hyolim Jeon, Sunwoo Hwang, Jihyun An, Yonghoon Kim, Haewoon Kwak, Jisun An, Jinyoung Han

Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2026 · 2026 · 1 cite

Mental-health symptom datasets are slow and expensive to build, limiting how well models generalize to real, diverse user posts. SynSym is an LLM-based framework that generates synthetic symptom training data by expanding each symptom into sub-concepts, varying linguistic style, and modeling co-occurrence. Models trained on SynSym data match or exceed those trained on real data.

Fan Huang, Songheng Zhang, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2604.01366, 2026 · 2026

LLMs show cognitive biases in behavior, but where are those biases encoded internally, and can they be removed? We introduce CogBias, a benchmark covering four bias families (Judgment, Information Processing, Social, Response), and show via linear probes that biases occupy separable directions in activation space. Activation steering reduces bias 26 to 32% with minimal capability loss.

Sangpil Youm, Sou Hyun Jang, Haewoon Kwak, Jaeyoung Choi, Yong Jeong Yi

JMIR Formative Research · 2026

How does public discourse around COVID-19 vaccines differ between the United States and South Korea? Analyzing nearly 8,000 question-answer pairs from Quora and Naver Knowledge-iN, we apply NLP to surface topical and sentiment differences. Both countries share concerns about side effects and efficacy, but diverge sharply on which questions get most attention.

Chaewon Kang, Youjin Lee, Haewoon Kwak, Jisun An, Jinyoung Han

CHI Poster, 2026 · 2026

Deepfake harm depends on the moment a viewer perceives the synthetic face as a specific victim, but where that threshold sits is undefined. With 102 participants and 162 manipulated stimuli, we find eyes are the dominant identity cue, that pairing eyes with one more feature triggers identity perception, and that human-AI misalignment peaks exactly at that threshold.

Byunghwee Lee, Sangyeon Kim, Filippo Menczer, Yong-Yeol Ahn, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2603.11253, 2026 · 2026

Innocuous online preferences (the bands you follow, the slang you use) can reveal sensitive traits. We show that LLMs reliably infer hidden political alignment from DebateOrg and Reddit conversations, outperforming traditional ML. Accuracy improves further when text-level inferences aggregate to the user level and when we use politics-adjacent domains, underscoring a serious privacy risk.

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

ACL · 2026

Different online communities have different rules: what gets you banned from one subreddit may be the norm in another. PluRule is a multimodal, multilingual benchmark with 13,371 rule violations across 1,989 Reddit communities, 2,885 rules, and 9 languages. Even GPT-5.2 performs only slightly better than a trivial baseline, exposing pluralistic moderation as a fundamental challenge.

Fan Huang, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2603.16017, 2026 · 2026

How consistently do LLMs apply ethical frameworks across reasoning steps? We introduce moral reasoning trajectories (sequences of framework invocations) and analyze them across six models. Over half of consecutive steps switch frameworks; only ~17% of trajectories stay consistent. Unstable trajectories are more vulnerable to persuasive attacks, and linear probes localize framework encoding to specific layers.

Sumin Han, Jisun An, Dongman Lee

Open MIND · 2026

Urban 'vibrancy' (the dynamic human activity in a place) is well-measured by floating-population data but rarely fed into traffic models. We use VAEs to compress floating-population data into Urban Vibrancy embeddings, then forecast future embeddings with LSTMs and feed them into traffic models (RNN, DCRNN, GTS, GMAN), improving accuracy and responsiveness across the board.

Fan Huang, Haewoon Kwak, Jisun An

ACL Findings · 2026

Are LLMs persuaded out of their beliefs? Using the Source–Message–Channel–Receiver (SMCR) communication framework across six mainstream LLMs and three domains (facts, medical QA, social bias), we measure how stable each model's stated beliefs are under persuasive pressure. The smallest model flips on 82.5% of attempts at the first turn. Counterintuitively, asking models to verbalize confidence makes them more vulnerable.

Hyunseon Won, Haewoon Kwak, Jisun An, Jinyoung Han

CHI Poster, 2026 · 2026

When AI joins a creative collaboration, does behaving cooperatively erase identity-based bias against it? In a turn-based collaborative drawing study (N=30), participants worked with a human confederate believed to be either human or AI. Cooperative behavior improved overall perceptions, but creativity ratings consistently favored human partners despite identical behavior. Identity bias persists.

Weihong Qi, Fan Huang, Rasika Muralidharan, Jisun An, Haewoon Kwak

arXiv preprint arXiv:2601.11286, 2026 · 2026

AI-human alignment is usually measured by outcome agreement (accuracy, F1). XChoice fits a mechanism-based decision model to both human and LLM choices, recovering interpretable parameters: which factors matter, how sensitive each is to constraints, what trade-offs get made. Applied to U.S. time-use data, XChoice reveals systematic misalignment concentrated in specific demographic subgroups.

Rachith Aiyappa, Shruthi Senthilmani, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

PeerJ Computer Science · 2026

Can LLMs do stance detection on tweets without any fine-tuning? Using FlanT5-XXL with SemEval 2016 and P-Stance datasets, we find zero-shot prompting matches or beats fine-tuned baselines. Performance varies with prompt phrasing, decoding strategy, and a measurable positivity bias, and we surface cases where the LLM over-commits to a stance when none is actually present.

Byunghwee Lee, Rachith Aiyappa, Yong-Yeol Ahn, Haewoon Kwak, Jisun An

Nature Human Behaviour, 2025 · 2025 · 16 cites

How do beliefs interconnect, and what drives a person to adopt new ones? We fine-tune large language models on online debate data to map thousands of beliefs into a semantic space where proximity reflects coherence. Position in that space predicts which beliefs an individual is likely to adopt next and quantifies cognitive dissonance via distance between existing and new beliefs.

Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, Jisun An, Ponnurangam Kumaraguru

arXiv preprint arXiv:2512.24848, 2025 · 2025 · 2 cites

Personalized AI agents must access private data (emails, chats, purchases) to be useful, but they can leak it. PrivacyBench is a benchmark with conversational scenarios containing embedded secrets. RAG assistants leak secrets in up to 26.6% of conversations; a privacy-aware prompt cuts this to 5.1%, but retrieval still pulls sensitive data indiscriminately.

Weihong Qi, Fan Huang, Jisun An, Haewoon Kwak

2025 IEEE International Conference on Data Mining Workshops (ICDMW) · 2025

Can LLMs realistically simulate public opinion across different cultures? Comparing DeepSeek-R1/V3 with Qwen2.5, GPT-4o, and Llama-3.3 against survey data from the U.S. (ANES) and China (Zuobiao), we find performance varies sharply by topic and demographic, and that all models over-generalize within demographic groups, defaulting to a 'representative' voice.

Youngjun Park, Jisun An, Dongman Lee

International Conference on Tourism Research · 2025

How do Airbnb listings shift month-to-month across a city's neighborhoods? Using public datasets covering Seoul from 2017 to 2019, we model monthly variation as a function of housing supply, lodging density, tourist-attraction proximity, and retail activity, achieving strong fit (avg pseudo R² of 0.79) and identifying which factors drive short-term rental dynamics in each district.

Rasika Muralidharan, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2510.07488, 2025 · 2025

What does team science say about multi-agent LLM systems? We test three core team-science variables (structure, diversity, interaction dynamics) across four reasoning tasks. Flat teams beat hierarchical ones, diversity has nuanced effects, and agents are overconfident before tasks but more reflective afterward, with both appreciation for collaboration and difficulty integrating.

Devin R. Wright, Jisun An, Yong-Yeol Ahn

arXiv preprint arXiv:2509.16813, 2025 · 2025

Identity fusion (the psychological merging of self with a group, ideology, or brand) has historically been measured with surveys. CLIFS uses cognitive linguistics and LLMs to detect implicit metaphors in text and score fusion automatically, matching survey instruments while scaling. As a proof of concept, it improves violence risk assessment by more than 240%.

Hongju Lee, Youngjun Park, Jisun An, Dongman Lee

arXiv preprint arXiv:2511.14248, 2025 · 2025

Forecasting regional Airbnb activity matters for housing policy, but typical models look at single listings at single time points. We forecast Revenue, Reservation Days, and Number of Reservations at the regional level by combining listing features with LLM-generated embeddings of urban accessibility and human mobility, cutting RMSE and MAE by roughly 48% versus standard baselines.

Dasom Hong, Sumin Han, Youngjun Park, Jisun An, Dongman Lee

Area · 2025

Do social-media check-ins reshape urban land value? Modeling three Seoul neighborhoods (Yeonnam-dong, Seongsu-dong, Gyeongridan-gil) that gentrified after going viral on Instagram, we show that adding online-popularity features to a spatial capital model outperforms physical-layout-only baselines, quantifying how Instagram visibility now factors into urban property prices.

Shintaro Sakai, Haewoon Kwak, Jisun An, Akira Matsui

arXiv preprint arXiv:2510.03905, 2025 · 2025

How did gender stereotypes shift in Japan across the 20th century? We train 100 year-specific word embeddings on Japanese corpora from 1900 to 1999 and trace stereotype strength in the Home, Work, and Politics domains. Work and Politics became more female-stereotyped over time, but Home did too, suggesting women came to be seen as occupying multiple roles rather than swapping them.

Joseph Malone, Rachith Aiyappa, Byunghwee Lee, Haewoon Kwak, Jisun An, Yong-Yeol Ahn

arXiv preprint arXiv:2511.18616, 2025 · 2025

When an LLM predicts someone's beliefs, what matters more: who they are (demographics) or what else they believe (prior stances)? Testing open-weight LLMs on online-debate data under four conditions, we find both signals improve prediction, but their relative value varies sharply across belief domains.

Kunihiro Miyazaki, Taichi Murayama, Takayuki Uchiba, Jisun An, Haewoon Kwak

EPJ Data Science, 2024 · 2024 · 80 cites

What does the public actually think of generative AI? We analyze 3M tweets from 2019 to 2023 and find broad interest across occupations, not just tech. Sentiment is generally positive and tracks exposure, with one exception: illustrators are notably negative, reflecting concerns over training-data ethics.

Preslav Nakov, Jisun An, Haewoon Kwak, Muhammad Arslan Manzoor, Zain Muhammad Mujahid, Husrev Taha Sencar

ACL Findings, 2024 · 2024 · 31 cites

Fact-checking every story is impossible. Instead, this survey reviews how to profile entire news outlets, so any article can be flagged the moment it appears. We argue that factuality and political bias should be modeled jointly rather than separately, and survey state-of-the-art across text, social context, and beyond.

Fan Huang, Haewoon Kwak, Kunwoo Park, Jisun An

LREC-COLING, 2024 · 2024 · 26 cites

As AI explains its own decisions in natural language, who should grade those explanations? We compare ChatGPT and human judgments on informativeness and clarity across binary, ternary, and 7-point scales, finding ChatGPT aligns well with humans on coarse-grained ratings, with paired comparison and dynamic prompting further improving alignment.

Kunihiro Miyazaki, Takayuki Uchiba, Haewoon Kwak, Jisun An, Kazutoshi Sasahara

Scientific Reports, 2024 · 2024 · 15 cites

Anti-vaccine YouTube comments are often dismissed as background noise, but do they actually push viewers toward hesitancy? We find that highly-liked toxic comments measurably increase fear in later comments, with contagion running in both directions. The result has implications for public health messaging and how moderators handle 'just trolling.'

Arvindh Arun, Saurav Chhatani, Jisun An, Ponnurangam Kumaraguru

WOAH · 2024 · 12 cites

What happened to hate speech on Twitter after Elon Musk's takeover and the relaxation of moderation? On 10 million tweets, we document significant increases in hate targeting LGBTQ+ users and liberals, with cohesive hate communities forming around influential 'bridge' users. Ironically, embracing free-speech principles enabled hate against the very concept of free speech itself.

Rachith Aiyappa, Shruthi Senthilmani, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

arXiv preprint arXiv:2403.00236, 2024 · 2024 · 7 cites

Can LLMs do stance detection on tweets without any fine-tuning? Using FlanT5-XXL with SemEval 2016 and P-Stance, zero-shot prompting matches or beats fine-tuned baselines. We trace how performance varies with prompt phrasing, decoding strategy, and prompt perplexity, and identify a positivity bias that partially explains performance differences across decoding strategies.

Hong Zhang, Quoc-Nam Nguyen, Prasanta Bhattacharya, Wei Gao, Liang Ze Wong, Brandon Siyuan Loh, Joseph J. P. Simons, Jisun An

arXiv preprint arXiv:2310.09848, 2024 · 2024 · 5 cites

Stance detection usually looks at message text alone. We add Moral Foundations Theory features (theoretically-derived dimensions of an individual's moral concerns) on top of standard semantic features, using both traditional ML and LLMs. Encoding moral foundations improves classification and illuminates which moral foundations associate with which online stances.

Fan Huang, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2402.11167, 2024 · 2024 · 5 cites

Detectors of AI-generated text are quietly competing against attacks that paraphrase or swap words. ToBlend goes further: it samples tokens randomly from an ensemble of LLMs, producing text that drops most detectors' performance significantly while human raters confirm quality holds. We also release a fine-tuned Llama-3.1 detector tuned to catch ToBlend output.

Fan Huang, Haewoon Kwak, Jisun An

arXiv preprint arXiv:2402.11167 · 2024 · 5 cites

Can AI-text detectors survive smarter adversaries? We propose ToBlend, which generates text by randomly sampling next tokens from multiple candidate LLMs. The blended output sharply degrades most mainstream detectors while staying fluent to human evaluators, exposing a robustness gap and motivating a fine-tuned Llama3.1 defender.

Zoher Kachwala, Jisun An, Haewoon Kwak, Filippo Menczer

NAACL Findings, 2024 · 2024 · 4 cites

Comparing knowledge graphs that represent sentence meaning is harder than it sounds: existing metrics are slow and miss semantic nuance. Rematch is a new AMR similarity metric that ranks first on semantic benchmarks (1 to 5 percentage points over state of the art) and runs five times faster.

Sumin Han, Jisun An, Y. Park, Suji Kim, Kitae Jang, Dongman Lee

arXiv preprint arXiv:2408.12890, 2024 · 2024 · 1 cite

Short-term transportation demand prediction guides scheduling, fleet sizing, and transit-network design. Most models use one or two areal features; cities need many (land use, sociodemographics, POI distribution). Our ST-MFGCRN model fuses multiple areal features via sentinel attention, beating state-of-the-art baselines by 7 to 8% on BusDJ and TaxiBJ datasets.

Sumin Han, Jisun An, Dongman Lee

arXiv preprint arXiv:2408.12882, 2024 · 2024

Mid-term road traffic prediction (a few hours out) matters for car-sharing and ride-hailing, but existing models focus on short-term effects between adjacent roads. We embed real-time regional knowledge (POIs, satellite images, LTE access traces) into a road-level attention model via dynamic convolution and bipartite spatial-transform attention, beating baseline approaches.

Fan Huang, Haewoon Kwak, Jisun An

WWW Companion, 2023 · 2023 · 453 cites

Implicit hate speech is hard for both detectors and explainers. We ask whether ChatGPT can produce useful natural-language explanations of why a subtle hateful post is hateful, and compare its output to human annotators through user studies. Promising for routine cases, but with consistent limitations on harder ones.

Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

TrustNLP (Collocated with ACL), 2023 · 2023 · 133 cites

Most ChatGPT evaluations claim impressive numbers, but how trustworthy are they? Through a stance-detection case study, we show that training-set contamination silently inflates ChatGPT's apparent performance. We argue that fair evaluation of closed, continuously-trained models needs new conventions.

Fan Huang, Haewoon Kwak, Jisun An

WWW Companion, 2023 · 2023 · 57 cites

Generating clear explanations of why a post is hateful is hard, especially when hate is implicit. Our Chain of Explanation prompting method feeds the model heuristic words and the targeted group up front, lifting BLEU scores from 44.0 to 62.3 and producing more informative explanations on human evaluation.

Kunihiro Miyazaki, Takayuki Uchiba, Kenji Tanaka, Jisun An, Haewoon Kwak, Kazutoshi Sasahara

AAAI ICWSM, 2023 · 2023 · 19 cites

Fact-checkers can't keep up with COVID-19 misinformation. We study spontaneous debunking by regular Twitter users: which fake tweets attract corrections, and who corrects them. Building an annotated dataset and a classifier, we find most fake tweets are never debunked, spontaneous corrections are slower than other replies, and political topics show a clear partisan asymmetry.

Shaoyi Liaw, Fan Huang, Fabricio Benevenuto, Haewoon Kwak, Jisun An

ICWSM Dataset, 2023 · 2023 · 17 cites

Conspiracy theories are widely propagated on YouTube, yet the data needed to study them at scale is limited. YouNICon is a dataset of videos from channels previously identified as suspicious, supporting both conspiracy-theory detection and topic classification. The dataset enables studying trends in conspiracy theories and how individuals interact with such channels.

Hong Zhang, Haewoon Kwak, Wei Gao, Jisun An

ACM WebSci, 2023 · 2023 · 13 cites

If you know someone's view on masks, can you predict their view on Trump? Across three seemingly unrelated COVID-era topics, we train stance-detection models on Twitter and find that opinions cluster predictably: past stances on one issue strongly forecast future stances on another.

Sumin Han, Youngjun Park, Minji Lee, Jisun An, Dongman Lee

ACM CIKM, 2023 · 2023 · 12 cites

Traffic prediction usually relies on sensor networks that miss the human side of urban movement. We add human-activity data from the National Household Travel Survey to graph-convolutional traffic models, achieving state-of-the-art accuracy with minimal architectural changes.

Kunihiro Miyazaki, Taichi Murayama, Akira Matsui, Masaru Nishikawa, Takayuki Uchiba, Haewoon Kwak, Jisun An

ACM WebSci, 2023 · 2023 · 7 cites

Newly-elected leaders typically enjoy a brief 'honeymoon' of high approval. We test this on social media by analyzing 6.6M tweets across 8 Japanese prime-minister transitions over 15 years. The honeymoon does appear, but the user base flips dramatically with each change, and sentiment patterns vary by topic and PM.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen

Behaviour & Information Technology, 2022 · 2022 · 54 cites

What makes a news post go viral, and on which platform? Analyzing 3M posts from 53 outlets and 50M comments across 5 platforms over 8 months, we trace how sentiment and topic shape four levels of engagement (views, likes, comments, cross-platform sharing) and predict cross-platform diffusion with up to 83% F1.

Claire Seungeun Lee, Juan Merizalde, John D. Colautti, Jisun An and Haewoon Kwak

Frontiers in Sociology, 2022 · 2022 · 49 cites

How does offline political speech translate into online mobilization? We combine the text of Trump's speeches leading up to January 6, 2021 with Twitter data from QAnon hashtags over the same three days, using mixed quantitative and qualitative analysis. The results show measurable links between specific addresses and online participation.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen

ACM Transactions on Social Computing, 2022 · 2022 · 30 cites

How do emotions in news posts ripple into audience reactions? Across 133K posts and 2.8M comments from 8 outlets on 4 platforms over 8 months, we score 9 emotions (Plutchik's wheel) and predict reader emotional engagement before and after a post goes live, reaching over 70% F1 for some emotions and revealing systematic platform differences.

Yun Yu Chong, Haewoon Kwak

Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), 2022 (short) · 2022 · 24 cites

Most toxicity research focuses on Western online communities. We extend trigger detection to an Asian context using Reddit conversations from Singapore, showing both that detection is feasible and that the triggers themselves can differ markedly from Western patterns. A reminder that 'toxic' is culturally specific.

Yelena Mejova, Jisun An, Gianmarco De Francisci Morales, Haewoon Kwak

ACM Transactions on Social Computing, 2022 · 2022 · 12 cites

Can social media reveal who will engage politically on contested issues? On the U.S. gun debate, we use Twitter content and network structure to classify user stance, predict state-level gun policy, and forecast individual offline political action. Network features (especially the relative size of each side) emerge as strong signals alongside content and demographics.

Haewoon Kwak

ACM WebSci, 2022 · 2022 · 2 cites

PlayStation 'trophies' are one of gaming's most-discussed achievement systems, but no large-scale study exists. We build the first complete dataset of PlayStation games and their trophies, then characterize them from both the designer's and the player's perspectives, uncovering patterns in how trophies are designed and which ones players actually complete.

Haewoon Kwak, Jisun An, Kunwoo Park

Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), 2022 (short) · 2022 · 1 cite

Conversation corpora power most interactive AI, but who actually contributed to them? Analyzing a Korean nationwide daily conversation corpus from the National Institute of Korean Language, we characterize the participation of different demographic groups (age and sex) and identify the voices that are over- and under-represented in this widely-used resource.

Kunihiro Miyazaki, Takayuki Uchiba, Haewoon Kwak, Jisun An

IEEE BigData, 2022 (short) · 2022

When Facebook became Meta, Twitter users spontaneously started proposing new acronyms for Big Tech ('MAANG,' 'MANGA,' and many more). We trace how candidates were proposed, spread, and judged, finding that early posters had fewer followers than later spreaders, rankings shifted slowly, and replies introduced most new ideas at the first depth.

Haewoon Kwak, Jisun An, Elise Jing, Yong-Yeol Ahn

PeerJ Computer Science 7:e644, 2021 · 2021 · 49 cites

Framing nudges readers toward conclusions without making explicit arguments. FrameAxis identifies the semantic axes ('microframes') that a document leans on, using word embeddings rather than manual annotation. The method scales to large datasets and exposes bias and intensity from restaurant reviews to political news.

Jisun An, Haewoon Kwak, Hanya M Qureshi, Ingmar Weber

JMIR Form Res 2021;5(9):e22313, 2021 · 2021 · 42 cites

Health campaigns usually broadcast one message to everyone. We propose a five-stage framework for precision public health: define goal and metrics, segment the audience, tailor a message per segment, pilot, run, and evaluate using social-media ad-targeting tools. Two case studies demonstrate the framework in practice.

Jisun An, Haewoon Kwak, Claire Seungeun Lee, Bogang Jun, Yong-Yeol Ahn

Findings of the Association for Computational Linguistics EMNLP 2021 · 2021 · 41 cites

Can we tell, before COVID-19 began, who would later post anti-Asian slurs on Twitter? Comparing users who did and didn't post such slurs after the pandemic's onset, we identify pre-pandemic features that predict later hate behavior, pointing to the role of news media and polarized communication networks in priming hostility.

Andreas Jungherr, Oliver Posegga, Jisun An

Social Science Computer Review. March 2021. · 2021 · 35 cites

If digital media empower populist movements, can we see it in distinct usage patterns? Comparing /r/The_Donald and /r/hillaryclinton during the 2016 election, we find Trump supporters expressed more anti-elitism and exclusion (but not 'people-centrism'), posted more, used distinctive vocabulary, and built tighter community ties.

Kunwoo Park, Haewoon Kwak, Jisun An, Sanjay Chawla

Proceedings of the 15th International AAAI Conference on Web and Social Media (ICWSM), 2021 · 2021 · 23 cites

When media outlets share an article on Twitter, the edited tweet often departs sharply from the original headline. Does that editing pay off? Using a parallel corpus of articles and tweets from 8 outlets, we apply propensity-score matching to estimate the causal effect of editing styles on engagement, surfacing both common and outlet-specific patterns.

Joni Salminen, Joao M. Santos, Haewoon Kwak, Jisun An, Soon-gyo Jung, Bernard J. Jansen

International Journal of Human-Computer Studies, 2020 · 2020 · 101 cites

Personas are widely used but hard to evaluate because no validated measurement instrument existed. We develop the Persona Perception Scale (PPS), drawing on social psychology, persona studies, and HCI, and validate it with 412 respondents. The scale lets researchers measure how individuals perceive personas (credibility, willingness to use) and improve persona adoption.

Ramy Baly, Georgi Karadzhov, Jisun An, Haewoon Kwak, Yoan Dinkov, Ahmed Ali, James Glass, Preslav Nakov

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) (2020) · 2020 · 72 cites

Fact-checking every claim is impossible, so a complementary approach is to profile entire news outlets and flag everything they publish. We argue that bias and factuality should be modeled jointly using both the text the outlet produces and the social context it lives in, and we review approaches that combine the two.

Hind Almerekhi, Haewoon Kwak, Bernard Jim Jansen, Joni Salminen (short)

Proceedings of The Web Conference (WWW), 2020 · 2020 · 68 cites

Toxic replies often have an innocent-looking trigger somewhere up the thread. We define a trigger as a non-toxic comment that provokes a toxic reply, then train a neural model on shifts in sentiment, topic, and conversational context. The model identifies triggers with a 0.87 ROC-AUC, suggesting a way to head off escalation before it happens.

Haewoon Kwak, Jisun An, Yong-Yeol Ahn

Proceedings of the 12th ACM Conference on Web Science (WebSci), 2020 · 2020 · 66 cites

Framing is hard to measure at scale because frames usually need hand-coding. We trained a state-of-the-art frame classifier and applied it to 1.5 million New York Times articles spanning 2000 to 2017. Short-term frame fluctuations track major events; long-term, the 'cultural identity' frame rises sharply, and a case study on mass shootings reveals three distinct patterns.

Soumen Ganguly, Juhi Kulshrestha, Jisun An, Haewoon Kwak

Proceedings of the 14th International AAAI Conference on Web and Social Media (ICWSM), 2020 · 2020 · 32 cites

Many political-bias datasets rest on three common assumptions: labelers' politics don't bias their labels, articles follow their outlet's leaning, and an outlet's leaning is stable across topics. We test all three on a manually annotated dataset and find each can fail even at small scale, a warning for how bias datasets are constructed and trusted.

Kunwoo Park, Haewoon Kwak, Hyunho Song, Meeyoung Cha

Proceedings of the 14th International AAAI Conference on Web and Social Media (ICWSM), 2020 · 2020 · 21 cites

Does showing your offline credentials affect how people respond to you online? On two Reddit science communities that adopted 'degree flairs,' we find the halo effect varies by community culture: in the strict peer-reviewed subreddit, any degree boosts votes and discussion; in the casual one, only PhD or MD flairs do.

Samuel S Guimarães, Julio CS Reis, Lucas Lima, Filipe N Ribeiro, Marisa Vasconcelos, Jisun An, Haewoon Kwak, Fabrício Benevenuto

IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2020 · 2020 · 9 cites

We propose a graph-based semi-supervised method to measure the political bias of Facebook news pages across many countries, separating alternative media, mainstream media, and public-figure pages. Validated on a U.S. dataset and applied to Brazil, where right-wing pages dominate generally but alternative news media break the pattern.

George Berry, Antonio Sirianni, Ingmar Weber, Jisun An, Michael Macy (preprint)

arXiv preprint arXiv:2001.11171, 2020 · 2020 · 5 cites

Social-network studies often estimate homophily from predicted attributes (gender or ethnicity from names), but high prediction accuracy doesn't guarantee accurate homophily estimates. We show estimation works only when dyad-level residuals cancel out, and propose an 'ego-alter' approach that beats standard node and dyad classifiers.

Andreas Jungherr, Oliver Posegga, Jisun An

The International Journal of Press/Politics, 24(4), 2019 · 2019 · 233 cites

Today's news ecosystems mix legacy outlets, platforms, influencers, and amateurs in ways that resist easy comparison. We propose 'discursive power' as a measurable concept: the ability to introduce, amplify, and sustain topics, frames, and speakers in interconnected communication spaces, enabling empirical comparative work on hybrid media systems.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen

Proceedings of the 13th International AAAI Conference on Web and Social Media (ICWSM), 2019 · 2019 · 157 cites

Different platforms surface different topics: how does that shape engagement? Across 3.16M posts from 53 news organizations on Facebook, Instagram, Twitter, YouTube, and Reddit over 8 months, we map topic distribution by platform and topic-driven engagement at four levels, finding that cross-platform sharing of any given article is predictable with around 80% precision.

Jisun An, Haewoon Kwak, Oliver Posegga, Andreas Jungherr

Proceedings of the 13th International AAAI Conference on Web and Social Media (ICWSM), 2019 · 2019 · 79 cites

Online political talk happens in two very different settings: groups of like-minded users, and mixed spaces where opposing views meet. We use Reddit to develop an analytical template comparing interaction and linguistic patterns within and between politically homogeneous and cross-cutting communication spaces, finding markedly different behaviors in each.

Hind Almerekhi, Haewoon Kwak, Bernard Jim Jansen, Joni Salminen (poster)

Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT), 2019 · 2019 · 73 cites

What sparks a toxic exchange in online discussions? We define toxicity 'triggers' formally, then build an LSTM model combining textual features with measures of topic and sentiment shift across interactions. Trained on diverse Reddit communities, the model detects triggers with 82.5% average accuracy.

Jisun An, Haewoon Kwak

Proceedings of Social Informatics (SocInfo), 2019 · 2019 · 34 cites

What images do top global brands actually share on Instagram and Facebook, and who shows up in them? Analyzing 85,957 advertising images from 73 international brands, we measure gender and racial diversity at scale, working toward an automated 'watchdog' for diversity in online advertisements.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen

Proceedings of Social Informatics (SocInfo), 2019 · 2019 · 19 cites

Among other findings, results show that content with high engagement on one platform does not guarantee high engagement on another platform, even when news outlets use similar cross-platform posts; however, for some content, cross-sharing posts on a platform will increase overall audience engagement on another platform.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen (Extended Abstracts)

Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (CHI), 2019 · 2019 · 16 cites

Creating 'engaging' content across social-media platforms is a daily grind for editors at serious news organizations, with no clear playbook. Through a focus-group study at a major news org, we find engagement work is effort- and time-intensive across three platforms, and we sketch a system design that would help editors select and adapt engaging passages from articles.

Arkaitz Zubiaga, Bahareh Heravi, Jisun An, Haewoon Kwak (Guest editorial)

Online Information Review, 2019 · 2019 · 12 cites

Social media now captures and announces breaking news faster than traditional outlets, reshaping the journalistic landscape. We survey how social-media mining is being adopted by professional journalists for newsgathering (drawing on citizen reporting and updates) and as an additional channel for news dissemination, opening both opportunities and challenges for newsrooms.

Yifan Zhang, Giovanni Da San Martino, Alberto Barrón-Cedeño, Salvatore Romeo, Jisun An, Haewoon Kwak, Todor Staykovski, Israa Jaradat, Georgi Karadzhov, Ramy Baly, Kareem Darwish, James Glass, Preslav Nakov (demo)

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019 · 2019 · 10 cites

Behind a news story sits a much larger context: the outlet's general factuality, partisanship, propaganda style, ideology, framing, and stance on key topics. Tanbih is a news aggregator that groups stories by event and generates these media profiles automatically, plus per-article propaganda detection and stance with respect to specific controversial claims.

Kholoud Khalil Aldous, Jisun An, Bernard J. Jansen

Proceedings of Social Informatics (SocInfo), 2019 · 2019 · 8 cites

We analyze ten common stylistic features indicative of user engagement for news postings on multiple social media platforms. We display the stylistic features usage differences of news posts from various news sources. Results show that there are differences in the usage of stylistic features across social media platforms (Facebook, Instagram, Twitter, and YouTube).

Savvas Zannettou, Barry Bradlyn, Emiliano De Cristofaro, Haewoon Kwak, Michael Sirivianos, Gianluca Stringhini, Jeremy Blackburn

Companion Proceedings of the The Web Conference (WWW), 2018 · 2018 · 389 cites

Gab launched in 2016 as a free-speech alternative to Twitter, attracting users banned from other platforms. We analyze 22M posts from 336K users (Aug 2016 to Jan 2018) and find Gab is mostly used for news and world events, attracts alt-right users, conspiracy theorists, and trolls, with hate-speech prevalence much higher than Twitter but lower than 4chan's /pol/.

Joni Salminen, Hind Almerekhi, Milica Milenković, Soon-gyo Jung, Jisun An, Haewoon Kwak, Bernard J. Jansen

Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018 · 2018 · 208 cites

Automatic hate-speech detection is hard partly because 'hate' covers many distinct types and targets. We hand-labeled 5,143 hateful YouTube and Facebook comments, built a granular taxonomy capturing both the kind and target of hate, and trained classifiers to detect and categorize new comments. A linear SVM with TF-IDF reaches 0.79 average F1.

Jisun An, Haewoon Kwak, Soon-gyo Jung, Joni Salminen, M. Admad, Bernard J. Jansen

ACM Transactions on the Web, 12(4), 2018 · 2018 · 176 cites

Marketing personas are hand-crafted, slow, and quickly outdated. We automate them from social-media data through behavioral segmentation, demographic grouping, and auto-generated names and photos. Validated on 30M viewer interactions across 198 countries and 4,200 videos, the personas predict content preference and remain stable over time.

Jisun An, Haewoon Kwak, Yong-Yeol Ahn

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018 · 2018 · 104 cites

Word meanings shift across communities and contexts, but most NLP tools collapse that nuance into a single sentiment dimension. SemAxis is a lightweight framework that characterizes words along arbitrary semantic axes in vector space, capturing community-specific meaning. On sentiment specifically, SemAxis beats state-of-the-art methods for building domain-specific lexicons.

Soon-gyo Jung, Jisun An, Haewoon Kwak, Joni Salminen, Bernard Jim Jansen

Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018 (short) · 2018 · 92 cites

When computational social scientists infer demographics from photos, which face-recognition tool can they trust? Testing Face++, IBM Bluemix, AWS Rekognition, and Microsoft Azure across multiple datasets, we find gender inference works (above 90%, except IBM), race inference is offered by only one tool, and age is consistently inaccurate. Triangulation and manual checks recommended.

Joni Salminen, Lene Nielsen, Soon-Gyo Jung, Jisun An, Haewoon Kwak, Bernard J. Jansen

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI), 2018 · 2018 · 91 cites

Does adding photos to a marketing persona make it more informative? Eye-tracking and interviews compare a single headshot, a headshot plus contextual scenes of the persona, and a headshot plus images of different people. Contextual images help; mixed people confuse readers, and viewers' own preconceptions shape what they see.

Haewoon Kwak, Jisun An, Joni Salminen, Soon-Gyo Jung, Bernard J. Jansen.

Proceedings of the 2018 World Wide Web Conference (WWW), 2018 · 2018 · 40 cites

Do people care about the same countries their national news covers? Across 193 countries over 14 months, we find that media and public attention align regionally but diverge by topic. In 63% of countries, the two attend to entirely different countries, suggesting audiences seek information their local outlets are not providing.

Joni Salminen, Bernard J. Jansen, Jisun An, Soon-Gyo Jung, Lene Nielsen, Haewoon Kwak

Proceedings of the 2018 Conference on Human Information Interaction & Retrieval (CHIIR), 2018 · 2018 · 30 cites

What confuses users of persona profiles? In a 29-participant eye-tracking and think-aloud study with three persona layouts, we measure where users fixate, what they say, and how the two relate. Inconsistencies between persona elements (especially between photos and quoted social-media voices) drive most confusion. We propose a 'confusion ratio' to flag problematic profile elements.

Peter Mawhorter, Sercan Şengün, Haewoon Kwak, D. Fox Harrell

IEEE Transactions on Games, 10(2), 2018 · 2018 · 3 cites

Avatars are a quiet form of self-expression, but the values built into avatar systems are rarely studied. Using deep learning and clustering on roughly 100,000 avatars from a major creation platform, we surface regional trends: realistic (US, mirroring demographics), idealistic (Japan, reflecting media tropes), and creative (community-spread), and link each to user activity.

Soon-Gyo Jung, Jisun An, Haewoon Kwak, Moeed Ahmad, Lene Nielsen, Bernard J. Jansen (Extended Abstract)

Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI), 2017 · 2017 · 125 cites

We demonstrate a real-time persona-generation pipeline on a YouTube channel of a global media corporation: 30M+ interactions, 181 countries, 4,200+ digital products. The system identifies distinct, impactful user segments and decorates each with auto-generated names, photos, and personal attributes, producing believable personas from real-time online data.

Kunwoo Park, Meeyoung Cha, Haewoon Kwak, Kuan-Ta Chen

Proceedings of the 26th International Conference on World Wide Web (WWW) Companion, 2017 · 2017 · 57 cites

Why do players stay or leave a game? Analyzing in-game logs of 51,104 players in an online multiplayer game across their virtual 'life phases,' we find what predicts retention shifts with progression: achievement features matter most early and mid-game, but at the highest level, social ties become the strongest predictor of who keeps playing.

J. Salminen, S. Sengün, H. Kwak, B. Jansen, J. An, S. Jung, S. Vieweg, D. F. Harrell

Proceedings of the 5th International Conference on Future Internet of Things and Cloud Workshops, 2017 · 2017 · 46 cites

We adapt automatic persona generation for Middle Eastern social-media users, blending computational analysis of millions of YouTube interactions with interview data to enrich personas. The result: culturally adapted personas that capture how different demographic groups in the region actually engage with content, and a methodology that respects cultural context.

Jisun An, Haewoon Kwak, Bernard J. Jansen (short)

Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017 · 2017 · 46 cites

Content creators distributing on online platforms need to understand their audiences. We generate personas from online user data using non-negative matrix factorization to identify user segments, then add names and photos for personality. The result is data-driven personas that represent real audience groups, replacing manually gathered data.

Julio Reis, Haewoon Kwak, and Jisun An, Johnnatan Messias, Fabrıcio Benevenuto

Proceedings of the 28th ACM Conference on Hypertext and Social Media (HT), 2017 · 2017 · 44 cites

Who shares news on Twitter, and what does that mean for what most readers see? In the first detailed study of news sharers' demographics, content, and audience, we find that male and white users dominate news sharing, biasing the visible mix toward those groups' interests, with implications for personalized news design.

Lene Nielsen, Soon-Gyo Jung, Jisun An, Joni Salminen, Haewoon Kwak, Bernard J. Jansen

Proceedings of the 29th Australian Conference on Computer-Human Interaction (OZCHI), 2017 · 2017 · 30 cites

Personas are supposed to align cross-team user understanding, but do they? Through 16 qualitative interviews with media producers at Al Jazeera English, we find understandings are highly varied, built from a mix of personal experience, self-projection, assumptions, and company data, and only sometimes match the data-driven personas.

D. Fox Harrell, Sarah Vieweg, Haewoon Kwak, Chong-U Lim, Sercan Sengun, Ali Jahanian, Pablo Ortiz

Proceedings of the 2017 ACM SIGCHI Conference on Creativity and Cognition (C&C), 2017 · 2017 · 13 cites

How do MENA-region users (whose cultures rarely shape platform design) creatively adapt social media to express identity? Combining computational analysis of 42,000+ Qatari social profiles with 255 qualitative profile analyses and interviews, we identify five distinct values driving virtual-identity creativity, and propose guidelines for systems that respect those values.

Jisun An, Haewoon Kwak (short)

Proceedings of the 11th International AAAI Conference on Web and Social Media (ICWSM), 2017 · 2017 · 12 cites

Why do some news topics get sustained media attention while others fade? Most prior work was limited to a few countries or topics for lack of global longitudinal data. We use large-scale data from 196 countries to provide empirical analyses of how media attention forms, peaks, and dissipates over time.

Ah Reum Kang, Jeremy Blackburn, Haewoon Kwak, Huy Kang Kim

Proceedings of the 26th International Conference on World Wide Web (WWW) Companion, 2017 · 2017 · 11 cites

What do people do when they know their work will be erased? Using 270 million records from the final days of the ArcheAge MMO closed beta (where all character data was about to be deleted), we find no broad antisocial collapse, but a sharp drop in quest completion, leveling, and ability changes. Most players stopped planting their proverbial apple trees.

Jisun An, Haewoon Kwak

Proceedings of Social Informatics (SocInfo), 2017 · 2017 · 9 cites

How differently do men and women, young and old, consume news? Examining 103,133 most-popular articles for different demographic groups in Daum News (Korea's second-largest portal) over 2015, we measure differences at four levels: items, sections, topics, and subtopics. Demographic differences appear most sharply at the subtopic level, not at the broader section or topic level.

Jisun An, Haewoon Kwak

Proceedings of the ICWSM Workshop on NEws and publiC Opinion (NECO), 2017 · 2017 · 7 cites

The Press Freedom Index, published annually by Reporters Without Borders, is invaluable but expensive and slow because it's survey-based. We propose an alternative: measure press freedom from the diversity of attention an outlet pays, computed automatically from large-scale news data. The result tracks the official PFI while updating in near-real time.

Jisun An, Hassan Aldarbesti, Haewoon Kwak

Proceedings of Social Informatics (SocInfo), 2017 (short) · 2017 · 4 cites

Are news outlets across countries converging on the same stories or going their separate ways? Using daily data on the 100 most over-reported topics in each country over 7 months, we find two patterns coexisting: globalization (shared attention across regions) and domestication (locally distinct coverage), shaped by geography, history, time, and topic.

Haewoon Kwak, Jisun An

Proceedings of the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2017 · 2017 · 3 cites

Which countries pay attention to each other in their news coverage, and which deliberately ignore? We build a multiplex Media Attention and Disregard Network across 129 countries over 212 days. The network is skewed, hierarchical, and asymmetric, but suggests a 'global village' overall, with persistent attention blocks around the Middle East/North Africa and Russia's neighbors.

Soo-Yong Shin, Dong-Woo Seo, Jisun An, Haewoon Kwak, Sung-Han Kim, Jin Gwack, Min-Woo Jo

Scientific Reports 6, Article number 32920 (2016) · 2016 · 151 cites

Can web searches and social-media chatter predict an emerging-disease outbreak? During the 2015 MERS outbreak in Korea, daily Google searches and tweets for four Korean keywords correlated strongly (above 0.7) with new and quarantined cases over the next three days, suggesting that simple digital signals can complement traditional epidemiological surveillance.

Jisun An, Ingmar Weber (short)

Proceeding of the 10th International Conference on Web and Social Media (ICWSM), 2016 · 2016 · 77 cites

Demographics matter for online behavior, but most social-media studies skip them for lack of data. We use face-analysis software to infer gender, age, and race from profile images of 350K New York Twitter users, then track which hashtags each demographic group uses over a year. Top hashtags overlap broadly, but each group also has its own distinctive set.

Jisun An, Haewoon Kwak, Yelena Mejova, Sonia Alonso Saenz De Oger, Braulio Gomez Fortes

Proceeding of the 10th International Conference on Web and Social Media (ICWSM), 2016 · 2016 · 46 cites

How did the world respond to the Charlie Hebdo shootings on Twitter? We test whether cultural background alone predicts stance on free speech (Huntington's 'Clash of Civilizations') or whether local social context matters too. Both: 'civilization' explains broad patterns, but non-Arabs embedded in mixed networks used #JeSuisAhmed five times more often.

Haewoon Kwak, Jisun An

Proceeding of the 10th International Conference on Web and Social Media (ICWSM), 2016 (short) · 2016 · 38 cites

GDELT and Event Registry are the two big repositories researchers use to study global news. We compare them across scale, news sources, and news geography. They differ sharply in scale and sources but agree closely on news geography — a useful warning and reassurance for anyone choosing between them.

Haewoon Kwak, Jisun An

ICWSM Workshop on NEws and publiC Opinion (NECO), 2016 · 2016 · 31 cites

What kinds of objects appear in news photos, and do they tell the same story as the article text? Analyzing over 2 million news photos from January 2016 using deep-learning vision APIs, we measure object frequency, photo sentiment, alignment with text tone, gender representation, and visual portrayal of political candidates. The first study of its scale.

Haewoon Kwak, Jeremy Blackburn, Seungyeop Han

Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI), 2015 · 2015 · 485 cites

Toxic behavior in team-based competitive games is widely felt but poorly understood empirically. Using over 10 million player reports on 1.46 million toxic players in League of Legends (with crowdsourced verdicts attached), we test hypotheses drawn from theories of toxic behavior at unprecedented scale, providing a foundation for systems to detect, prevent, and counter such acts.

Julio Reis, Fabrıcio Benevenuto, Pedro Olmo, Raquel Prates, Haewoon Kwak, Jisun An

Proceeding of the 9th International Conference on Web and Social Media (ICWSM), 2015 · 2015 · 252 cites

News headlines are the first impression in the click economy. Analyzing 69,907 headlines from four major media corporations over 8 months in 2014, we extract sentiment features and find a strong link between a headline's sentiment polarity and both the popularity of the news and the dynamics of its comments. Sentiment shapes how news travels.

Jisun An, Ingmar Weber

EPJ Data Science, 4, Article number 22, 2015 · 2015 · 29 cites

Social media 'now-casting' (predicting flu activity, unemployment) usually counts every tweet equally, but social-media users don't equally represent the offline population. We show that filtering by user characteristics (profile completeness, group membership) substantially improves now-casting accuracy, even when most users are excluded the right subset can outperform the whole.

Sofiane Abbar, Jisun An, Haewoon Kwak, Yacine Messaoui, Javier Borge-Holthoefer

Computation+Journalsim (C+J) Symposium, 2015 · 2015

News-geography research usually relies on what outlets publish, but what about what audiences actually read and react to? We pair 20,000 Aljazeera articles with 2M+ comments by 90,000 readers (countries inferred from IP) to contrast the producer's lens (coverage) with the consumer's lens (attention), exposing where the two diverge.

Jeremy Blackburn, Haewoon Kwak

Proceedings of the 23rd international conference on World wide web (WWW), 2014 · 2014 · 306 cites

League of Legends crowdsources toxicity judgements through the 'Tribunal,' but human review is slow and expensive. We train classifiers on 10M+ player reports covering 1.46M toxic players and the corresponding crowdsourced verdicts. The model captures strong-majority cases well and ports across regions, suggesting a practical way to reduce moderation cost.

Martin Pielot, Rodrigo de Oliveira, Haewoon Kwak, Nuria Oliver

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2014 · 2014 · 271 cites

Instant messaging silently signals 'last seen,' creating social pressure to reply fast. We show those signals are also weak predictors of actual attentiveness. Tracking 24 users over 2 weeks, we train a model on phone signals (notifications, screen activity, proximity, ringer mode) that predicts whether a user will view a message within minutes at 70% accuracy.

Haewoon Kwak, Jeremy Blackburn

SocInfo Workshop on Exploration on Games and Gamers (EGG), 2014 · 2014 · 162 cites

What does toxic player chat actually look like, linguistically? Using crowdsourced data on 590,000+ reported toxic players in League of Legends, we trace the language patterns that distinguish toxic players from typical ones, and find empirical support for how a player gradually slides from typical into toxic behavior. The findings enable earlier detection and warning.

Jisun An, Daniele Quercia, Jon Crowcroft

Proceedings of the Second ACM Conference on Online Social Networks (COSN), 2014 · 2014 · 143 cites

Even if Facebook users see politically diverse articles, do they only share ones that match their views? We test for 'partisan sharing' across individuals, time, and topic importance, and find strong evidence it exists. A controlled experiment with 70+ BBC viewers shows partisan sharing distorts perception of reality, but is also associated with more political knowledge.

Jisun An, Daniele Quercia, Jon Crowcroft

Proceedings of the 23rd international conference on World wide web (WWW), 2014 · 2014 · 140 cites

Most Kickstarter projects fail because creators can't reach enough investors. We propose ways to recommend Twitter accounts likely to pledge to a given project, translating hypothesis-driven analyses of pledging behavior into recommendation strategies. The best strategy predicts a list of plausible investors with 84% accuracy.

Laszlo Gyarmati, Haewoon Kwak, Pablo Rodriguez

KDD Workshop on Large-Scale Sports Analytics, 2014 · 2014 · 134 cites

Is there really such a thing as a unique team style in modern soccer? We introduce 'flow motifs' — statistically significant patterns in pass sequences — to compare teams' play. Most teams converge on similar structures, but FC Barcelona's tiki-taka turns out to be a genuinely distinctive, tightly constructed pattern rather than a flurry of random passes.

Jisun An, Daniele Quercia, Meeyoung Cha, Krishna Gummadi, Jon Crowcroft

EPJ Data Science volume 3, Article number 12, 2014 · 2014 · 114 cites

Why do people share certain political articles and not others? We propose a Political News Sharing model integrating gratification, selective exposure, socialization, and trust/intimacy. On Twitter data, sharing is driven by what one agrees with (selective exposure) but also by source credibility and informativeness, and users sometimes share opposing articles when those match their followers' leanings.

Haewoon Kwak, Jisun An

Proceedings of Social Informatics, 2014 · 2014 · 83 cites

What determines whether the world's news media notice a disaster? Using the GDELT dataset that monitors news in over 100 languages, we fit a hierarchical regression and find population, political stability, and damage all matter as expected, but also expose strong regionalism in news geography that's only visible at truly global scale.

Ilias Leontiadis, Antonio Lima, Haewoon Kwak, Rade Stanojevic, David Wetherall, Konstantina Papagiannaki

Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (CoNEXT), 2014 · 2014 · 75 cites

Cellular networks already log subscribers' coarse location continuously. We ask whether such sparse, irregular data is enough to reconstruct each device's actual daily path. If so, the same approach extends from smartphones to the millions of future connected 'things,' enabling continuous mobility tracking without GPS or dedicated apps.

Haewoon Kwak, Jisun An (extension of SocInfo'14)

Computation+Journalism (C+J) Symposium, 2014 · 2014 · 33 cites

What determines whether a disaster gets global news coverage? Using the GDELT dataset that monitors news in over 100 languages worldwide, we fit a hierarchical mixed-effect regression model and find that population, political stability, and damage all matter (as prior research suggested), but also expose strong regionalism that shows up only when the dataset is genuinely global.

Haewoon Kwak, Jonggun Lee (poster)

Proceedings of the 23rd International Conference on World Wide Web Companion, 2014 · 2014 · 6 cites

Can Twitter measure a paper's scientific impact? We track how 75,677 research papers from the top 100 journals get shared on Twitter (403,165 tweets by 142,743 users over 135 days). Twitter's reach for scholarly content is real, but attention is highly skewed toward a small subset of top journals, limiting its value as a fair impact signal.

Jisun An, Daniele Quercia, Jon Crowcroft (poster)

Proceedings of the 22nd International Conference on World Wide Web (WWW) Companion, 2013 · 2013 · 86 cites

Selective exposure has been theoretically promising but hard to test empirically. Looking at news articles shared on Facebook, we find concrete evidence that users predominantly share like-minded articles and avoid conflicting ones, with partisans especially so. Building tools to counter this would require identifying partisan users, but the subgroup is demographically indistinguishable from the average user.

Bo Xu, Yun Huang, Haewoon Kwak, Noshir S. Contractor

Proceedings of the 16th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), 2013 · 2013 · 69 cites

What makes a Twitter user stop following another? Using actor-oriented modeling (SIENA) on tightly-knit user groups, we find relational properties dominate: mutual following and shared followees reduce unfollow likelihood, and unfollows tend to be reciprocal (being unfollowed prompts unfollowing back). Surprisingly, homophily of interests and informativeness of interactions show no measurable effect.

Fabio Pezzoni, Jisun An, Andrea Passarella, Jon Crowcroft, Marco Conti

Proceedings of the 5th International Conference on Social Informatics (SocInfo), 2013 · 2013 · 50 cites

What makes a tweet get retweeted? Combining network structure with user behavior, we find that a message's visibility in a user's incoming flow strongly predicts whether they pass it on, but latent traits of the original poster also matter. We turn these findings into a propagation model whose simulated cascades match the empirical ones.

Yoonsung Hong, Haewoon Kwak, Youngmin Baek, Sue Moon

WWW Workshop on Multidisciplinary Approaches to Big Social Data Analysis, 2013 · 2013 · 29 cites

Sentiment analysis needs language-specific lexicons, but non-English languages often lack them. Tower of Babel is a language-independent crowdsourcing game that builds sentiment lexicons through play. A 135-participant study finds the game produces accurate classifications, faster than manual annotation and rated as more enjoyable.

Jisun An, Daniele Quercia, Jon Crowcroft

Proceedings of the 5th Annual ACM Web Science Conference (WebSci), 2013 · 2013 · 21 cites

Tools that expose people to diverse opinions are pitched as a fact-checking aid, but does anyone want them? Through surveys of 60 people and in-depth interviews of 10 in the UK and South Korea, we find people don't equally value opinion diversity, and even the same person benefits only sometimes, depending on prior convictions, emotional state, and social context.

Jisun An, Daniele Quercia, Meeyoung Cha, Krishna Gummadi, Jon Crowcroft

Proceedings of the 5th Annual ACM Web Science Conference (WebSci), 2013 · 2013 · 7 cites

Twitter exposes who follows which news outlet and how outlets interact through their audiences. Building a map of 77 media outlets from Twitter subscription overlap, we can answer questions previously locked in media databases: how much do NYT and WSJ readers overlap? Are outlets competing or complementary? Do friends subscribe to similar outlets?

Jisun An, Meeyoung Cha, Krishna Gummadi, Jon Crowcroft, Daniele Quercia

ICWSM Workshop on the Potential of Social Media Tools and Data for Journalists, 2012 · 2012 · 109 cites

Tracking political bias in news outlets matters for media literacy. We map media sources onto a left-right political spectrum using Twitter co-subscription (who follows which outlet). Analyzing 7M follow links shows the political split naturally emerges from direct subscription data, and we demonstrate a real-time application visualizing the ideological landscape.

Haewoon Kwak, Sue Moon, Wonjae Lee (4 page poster)

Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM), 2012 · 2012 · 59 cites

Why do people unfollow on Twitter? We fit a logistic regression with structural and behavioral features (ego, dyad, and triad properties) on large-scale data, finding 5 structural and 7 actional variables that significantly predict unfollows. One striking pattern: people are less likely to unfollow accounts that have acknowledged them, even in lopsided communications.

Jisun An, Meeyoung Cha, Krishna Gummadi, Jon Crowcroft

Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM), 2011. · 2011 · 246 cites

Who you follow on Twitter shapes which media you see, but indirect exposure (friends who follow outlets, retweets from outside your circle) matters too. Using public follow data, we show that indirect exposure expands the political diversity of news a typical user encounters by 60 to 98%, a finding not visible to traditional media research.

Haewoon Kwak, Hyunwoo Chun, Sue Moon

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2011. · 2011 · 232 cites

How often do Twitter users actually unfollow each other? Tracking 1.2M Korean-speaking users daily for 51 days, we find unfollowing is common. Quantitative analysis identifies key factors (reciprocity, relationship duration, followees' informativeness, overlap of relationships); 22 interviews reveal users unfollow those who tweet too often, about boring topics, or about mundane life details.

Haewoon Kwak, Sue Moon, Young-Ho Eom, Yoonchan Choi, Hawoong Jeong

Journal of Korean Physical Society, Vol. 59, No. 5, November 2011. · 2011 · 21 cites

Community-detection algorithms tend to return different communities each time you change the input node ordering. We propose two metrics (pairwise membership probability, consistency) and use the first as link weights to drive a procedure that converges to consistent communities within six cycles or fewer, without sacrificing modularity.

Haewoon Kwak, Changhyun Lee, Hosung Park, Sue Moon

Proceedings of the 19th international conference on World wide web (WWW), 2010. · 2010 · 10,578 cites

Twitter exploded from a nascent microblogging service to over 41M users in just a few years. We crawl the full Twitter graph (1.47B follow relationships, 4,262 trending topics, 106M tweets) and characterize its topology and information dynamics, asking whether Twitter behaves more like a social network or a news medium. The finding: more news than network.

Changhyun Lee, Haewoon Kwak, Hosung Park, Sue Moon (poster)

Proceedings of the 19th international conference on World wide web (WWW), 2010. · 2010 · 220 cites

Many methods identify influential Twitter users from follower or retweet counts, but ignore when information actually gets adopted. We propose a method that incorporates both link structure and the temporal order of adoption, surfacing a distinct class of influentials that previous methods miss.

Sue Moon, Jinyoung You, Haewoon Kwak, Daniel Kim, and Hawoong Jeong (invited paper)

Proceedings of the Second International Conference on COMmunication Systems and NETworks (COMSNETS), 2010. · 2010 · 9 cites

Community-detection algorithms often disagree on which nodes belong where. Building on an iterative-reinforcement approach across 12 real networks, we find about 40% of nodes are grouped inconsistently in large social graphs (Orkut, Cyworld). Using participation-coefficient analysis, we classify nodes into seven structural roles and compare two AS networks built from different sources.

Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, Sue Moon

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (IMC), 2009 · 2009 · 2,234 cites

YouTube reshaped how people make and watch video, and traditional video-on-demand metrics don't apply. We analyze the world's largest user-generated content video system, characterizing the popularity distributions, lifetimes, and request patterns of user-uploaded videos, with implications for caching, monetization, and platform design.

Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon

ACM/IEEE Transactions on Networking, Vol 17, Issue 5, 2009 · 2009 · 666 cites

User-generated video systems differ fundamentally from traditional video-on-demand. Analyzing YouTube and Korea's Daum Videos, we characterize UGC popularity distributions, the lifetime of a video's popularity, and the relationship between requests and content age. We also surface the impact of content aliasing and illegal uploads on ranking accuracy.

Xavier Amatriaain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, Nuria Oliver

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 2009 · 2009 · 233 cites

Standard collaborative filtering suffers from data sparsity, cold start, and noise. We propose computing recommendations from a separate set of expert reviews (instead of the user-rating graph itself), weighted by similarity to each user. Validated on Netflix data and a user study where participants preferred our recommendations, the method preserves accuracy while avoiding CF's weaknesses.

Haewoon Kwak, Yoonchan Choi, Young-Ho Eom, Hawoong Jeong, Sue Moon

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement (IMC), 2009 · 2009 · 90 cites

Three popular community-detection algorithms (CNM, Wakita, Louvain) all return different communities every time you change input node ordering. We propose pairwise membership probability as a consistency metric, then use it as link weights in an iterative procedure that converges to consistent communities within six cycles without sacrificing modularity.

Haewoon Kwak, Hwa-Yong Shin, Jong-Il Yoon, Sue Moon (poster)

Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media (ICWSM), 2009 · 2009 · 10 cites

Online services group users with shared interests, but their group-recommendation signals stay siloed inside each service. We propose connecting users across multiple web services through their user-labeled tags, which carry less privacy risk and stay fresher than profile data. A vector-space model with intra-personomy normalization finds similar users effectively across six services.

Hyunwoo Chun, Haewoon Kwak, Young-Ho Eom, Yong-Yeol Ahn, Sue Moon, and Hawoong Jeong

Proceedings of the 8th ACM SIGCOMM conference on Internet measurement (IMC), 2008 · 2008 · 278 cites

Is a Cyworld 'friend' an active relationship or just a declaration? We compare friend-list networks against activity networks formed by actual guestbook comments. The two have similar topology, but only the activity network (with reciprocated comments as edges) behaves like a true social network: more assortative, with fewer artificially dense cores.

Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, Hawoong Jeong

Proceedings of the 16th international conference on World Wide Web (WWW), 2007 · 2007 · 1,428 cites

Do online relationships behave like real-life ones? We compare three online social networks of 10M+ users each (Cyworld, MySpace, orkut), analyzing degree distribution, clustering, and degree correlation. Cyworld's distribution shifts toward multi-scaling over time; the three networks differ in scaling exponents, and certain online behaviors diverge from real-life social patterns.