Selected research activities
My research focus is on the intersection of machine learning and education.
Selected areas of my work include:
1. Learning and content analytics: analyzing the knowledge state of every student and the quality of every learning resource, i.e., textbooks, lecture videos, assessment questions.
2. Grading and feedback: automatically grading student responses to open-response questions and providing feedback.
3. Personalization: recommending fail-safe personalized learning actions for every individual student to maximize their future knowledge retention.
4. Behavior analysis: analyzing how learner behavior affects learning outcome.
I am also broadly interested in many areas of machine learning and its application to many other domains.
Selected works and publications are listed below; see my publications page for a full list.
Learning and Content Analytics
SPARse factor analysis for learning and content analytics (SPARFA)
SPARFA is a purely data-driven framework for learning and content analytics. Under the observation that there are only a small number of latent factors (which we term ''concepts'') that control students' performance, SPARFA analyzes binary-valued (correct/incorrect) graded student responses to assessment questions, and jointly estimates i) question-concept associations, ii) student concept knowledge, and iii) question intrinsic difficulties. SPARFA performs learning analytics by providing personalized feedback to the students on their knowledge level on each concept, and performs content analytics by analyzing how every question is related to each concept and how difficult it is. The original SPARFA paper can be found here:
An extension to analyze ordinal responses (partial credits) can be found here:
An extension that jointly analyzes graded response data and question text to interpret the meaning of the latent concepts can be found here:
An extension that performs time-varying learning analytics by tracing students' knowledge evolution through time and also improves content analytics by analyzing the content and quality of learning resources (e.g., textbooks, lecture videos, etc.) can be found here:
Non-linear student-response models: Dealbreaker and BLAh
Most existing student-response models are linear and additive, which achieve good prediction performance but admits limited interpretability. We develop two non-linear student-response models, the Dealbreaker model, which models a students' chance in answering a question correctly as only dependent on their minimum concept knowledge among the concepts the question covers, and the Boolean logic analysis (BLAh) model, which models binary-valued graded student responses as outputs of Boolean logic functions.
Traditional compensatory student-response models, including SPARFA, characterizes a student's success probability when answering a question as dependent on a linear combination of their knowledge on different concepts. Such linear models can be used to predict unobserved responses, but offer limited interpretability since they allow students to make up for their lack of knowledge on certain concepts with high knowledge on other concepts. On the contrary, the Dealbreaker model is a non-linear model that characterizes a student's success probability on a question as only dependent on their weakest knowledge among all concepts tested in the question. The Dealbreaker paper can be found here:
The BLAh model goes beyond the "AND" family of models the Dealbreaker model belongs, and characterizes the graded response of a student on a question as the output of the Boolean logic function corresponding to the question, therefore being much more flexible and interpretable than the the Dealbreaker model. The BLAh paper can be found here:
Automatic question generation: QG-Net
The ever growing amount of educational content renders it increasingly difficult to manually generate sufficient practice or quiz questions to accompany it. We propose QG-Net, a recurrent neural network-based model specifically designed for automatically generating quiz questions from educational content such as textbooks. QG-Net outperforms state-of-the-art neural network-based and rules-based systems for question generation, both when evaluated using standard benchmark datasets and when using human evaluators. The paper can be found here:
Grading and Feedback
Mathematical language processing (MLP)
MLP is a framework for analyzing students' responses to open-response mathematical questions for grading and feedback. We featurize and cluster students' responses to open-ended mathematical questions, e.g., freelancing derivations that are common in science, technology, engineering and mathematics (STEM) fields. Then, we perform automatic grading and feedback using a small number of instructor-graded responses. The MLP paper can be found here:
We developed a new natural language processing-based framework to detect the common misconceptions among students' textual responses to short-answer questions. Our framework excels at classifying whether a response exhibits one or more misconceptions. More importantly, it can also automatically detect the common misconceptions exhibited across responses from multiple students to multiple questions; this property is especially important at large scale, since instructors will no longer need to manually specify all possible misconceptions that students might exhibit. The paper can be found here:
Personalized learning action selection
We study the problem of turning the insights gained from learning and content analytics into personalization -- providing personalized recommendations for each student on what learning actions (read a section of a textbook, watch a lecture video, work on a practice question, etc.) the should take. We make use of the contextual bandits framework; the papers can be found here:
An extension on taking uncertain context into account can be found here:
We demonstrate that linearizing the probit model in combination with linear estimators performs on par with state-of-the-art nonlinear regression methods, such as posterior mean or maximum a-posteriori estimation. More importantly, we derive exact, closed-form, and nonasymptotic expressions for the mean-squared error of our linearized estimators. Applying our linearization technique to IRT models (the Rasch model, in particular) yields much tighter bounds on learner and question parameter estimates, especially when the numbers of learners and questions are small. Therefore, our analysis has the potential to improve the safety of personalization. The papers can be found here:
Measuring engagement from clickstream data
We propose a new model for learning that relates video-watching behavior to engagement level. One of the advantages of our method for determining engagement is that it can be done entirely within standard online learning platforms, serving as a more universal and less invasive alternative to existing measures of engagement that require the use of external devices. We also find that our model identifies key behavioral features (e.g., larger numbers of pauses and rewinds, and smaller numbers of fast forwards) that are correlated with higher learner engagement. The paper can be found here:
Instructor preference analysis
We propose a latent factor model that analyzes instructors' preferences in explicitly excluding particular questions from learners' assignments in a particular subject domain. We incorporate expert-labeled Bloom's Taxonomy tags on each question as a factor in our statistical model to improve model interpretability. Our model provides meaningful interpretations that help us understand why instructors exclude certain questions, thus helping automated learning systems to behave more "instructor-like". The paper can be found here:
Prerequisite structure extraction from user clickstreams
Existing approaches to automatically inferring prerequisite dependencies rely on analysis of either content (e.g., topic modeling of text) or performance (e.g., quiz results tied to content) data, they are not feasible in cases where courses have no assessments or only short content pieces (e.g., short video segments). We propose an algorithm that extracts prerequisite information using learner behavioral data instead, and apply it to an online short course. Our algorithm excels at both predicting learner behavior and revealing fine-granular insights into prerequisite dependencies between content segments, with validation provided by a course administrator. The paper can be found here:
Personalized thread recommendation in MOOCs
We propose a probabilistic model for the process of learners posting on such forums, using point processes. Different from existing works, our method integrates topic modeling of the post text, timescale modeling of the decay in post activity over time, and learner topic interest modeling into a single model, and infers this information from user data. Our method also varies the excitation levels induced by posts according to the thread structure, to reflect typical notification settings in discussion forums. We experimentally validate the proposed model on three real-world MOOC datasets, with the largest one containing up to 6,000 learners making 40,000 posts in 5,000 threads. Results show that our model excels at thread recommendation, achieving significant improvement over a number of baselines, thus showing promise of being able to direct learners to threads that they are interested in more efficiently. The paper can be found here:
Learning robust binary hash functions
We propose a new data-dependent method to learn binary hash functions. Inspired by recent progress in robust optimization, we develop a novel hashing algorithm, dubbed RHash, that minimizes the worst-case distortion among pairs of points in a dataset. We show that RHash achieves the same retrieval performance as the state-of-the-art algorithms in terms of average precision while using up to 60% fewer bits, using several large-scale real-world image datasets. The paper can be found here:
Sensor selection for biosensing and structural health monitoring
We develop a new sensor selection framework for sparse signals that finds a small subset of sensors (less than the signal dimension) that best recovers such signals. Our proposed algorithm, Insense, minimizes a coherence-based cost function that is adapted from classical results in sparse recovery theory. Using a range of datasets, including two real-world datasets from microbial diagnostics and structural health monitoring, we demonstrate that Insense significantly outperforms conventional algorithms when the signal is sparse. The paper can be found here:
Cloud dynamics and bidding strategy
We propose a nonlinear dynamical system model for the time-evolution of the spot price as a function of latent states that characterize user demand in the spot and on-demand markets. This model enables us to adaptively predict future spot prices given past spot price observations, allowing us to derive user bidding strategies for heterogeneous cloud resources that minimize the cost to complete a job with negligible probability of interruption. The paper can be found here:
We show that with the availability of an initial guess, phase retrieval can be carried out with an ever simpler, linear procedure. Our algorithm, called PhaseLin, is the linear estimator that minimizes the mean squared error (MSE) when applied to the magnitude measurements. We demonstrate that by iteratively using PhaseLin, one arrives at an efficient phase retrieval algorithm that performs on par with existing convex and nonconvex methods on synthetic and real-world data. The paper can be found here:
A method relying on a novel linear spectral estimator (LSPE) to obtain accurate initialization for phase retrieval: