Statistical data mining for Sina Weibo, a Chinese micro-blog: sentiment modelling and randomness reduction for topic modelling

Cheng, Wenqian (2017) Statistical data mining for Sina Weibo, a Chinese micro-blog: sentiment modelling and randomness reduction for topic modelling. PhD thesis, London School of Economics and Political Science.

Preview

Text - Submitted Version
Download (13MB) | Preview

Abstract

Before the arrival of modern information and communication technology, it was not easy to capture people’s thoughts and sentiments; however, the development of statistical data mining techniques and the prevalence of mass social media provide opportunities to capture those trends. Among all types of social media, micro-blogs make use of the word limit of 140 characters to force users to get straight to thepoint, thus making the posts brief but content-rich resources for investigation. The data mining object of this thesis is Weibo, the most popular Chinese micro-blog. In the first part of the thesis, we attempt to perform various exploratory data mining on Weibo. After the literature review of micro-blogs, the initial steps of data collection and data pre-processing are introduced. This is followed by analysis of the time of the posts, analysis between intensity of the post and share price, term frequency and cluster analysis. Secondly, we conduct time series modelling on the sentiment of Weibo posts. Considering the properties of Weibo sentiment, we mainly adopt the framework of ARMA mean with GARCH type conditional variance to fit the patterns. Other distinct models are also considered for negative sentiment for its complexity. Model selection and validation are introduced to verify the fitted models. Thirdly, Latent Dirichlet Allocation (LDA) is explained in depth as a way to discover topics from large sets of textual data. The major contribution is creating a Randomness Reduction Algorithm applied to post-process the output of topic models, filtering out the insignificant topics and utilising topic distributions to find out the most persistent topics. At the end of this chapter, evidence of the effectiveness of the Randomness Reduction is presented from empirical studies. The topic classification and evolution is also unveiled.

Item Type:	Thesis (PhD)
Additional Information:	© 2017 Wenqian Cheng
Library of Congress subject classification:	H Social Sciences > HA Statistics
Sets:	Departments > Statistics
URI:	http://etheses.lse.ac.uk/id/eprint/3488

Actions (login required)

Record administration - authorised staff only

Download statistics

Downloads

Downloads per month over past year

View more statistics