Additional Navigation

Applied Data Science (MS) Student Capstone Projects

Case Analysis Capstone (ADS670) aims to develop both technical and soft skills that are not directly taught in the traditional courses in the program, but are relevant and critical in order to develop, innovate and communicate in modern data science. This is a project-oriented capstone that will harness the skills gained throughout the program.

Below are some examples of original research studies done by students in our master's in Applied Data Science program for their completed capstone projects.

Class of 2024

Development and evaluation of curve fitting models to reduce calibration time for GEM blood-gas analyzer**

By: Celine Breton G'24

The GEM Premier 5000 is a point-of-care blood-gas analyzer that reports
concentrations of analytes in patient blood samples using sensors. It provides
accurate results through frequent calibration using process control solutions (PCS),
which are exposed to the sensors for 55 seconds. During this PCS exposure, mV
readings from the sensors are recorded, and the mV reading from the end of the
soak profile (at t=55 seconds) is used for calibration. The purpose of this study was
to develop and evaluate curve fitting models to reduce the amount of time PCS
solutions need to be exposed to the blood sensors, which would have benefits of
increased instrument availability and allow users to process higher sample volumes.
This study focused on developing methods for PCS A, which is the main calibration
solution, and the pCO2 sensor. 20,000 PCS A soak profiles were used to develop
curve fitting models to predict the end soak profile mV using varying time segments
from 1-37 seconds. Three model types were evaluated: linear, parabolic, and
constrained parabolic. These models were evaluated via two main metrics: the
RMSE of the predicted end soak profile mV versus the actual, as well as the amount
of error using the predicted soak profile mV would introduce into a hypothetical
patient blood sample. The constrained parabolic model with a time frame of 30-37
seconds, and a constrained vertex of 80 performed best with a RMSE of 0.66 and
with 99% of the hypothetical samples calculated using the predicted mV within total
allowable error.

** Video Not available due to the proprietary nature of the topic.

Healthcare field advancement: The implementation of machine learning in diagnosing age-related health conditions

By: Kyle Lacson G'24

Medical diagnosis is an integral process for medical professionals within the healthcare committee as it is essential for proper diagnosis treatment. By leveraging data science principles, researchers and healthcare professionals can utilize machine learning and deep learning methods to develop and draw accurate medical conclusions. Patient health and medical info were collected and anonymized by InVitro Cell Research for data enthusiasts to explore different classification methods and algorithms. The data was pre-processed and visualized to better understand what the data was comprised of. A handful of algorithms that consisted of Gaussian
classifiers, tree classifiers, and deep learning methods were trained and evaluated using multiple variations of the dataset. The XGBoost classifier proved to be the best algorithm for this job and correctly diagnosed 95.96% of the individuals that had a medical complication or not. It was concluded that machine learning and deep learning principles will help develop and advance the healthcare field.

Video Link: https://youtu.be/qWILmad83-Q

Curious Confounders: A Gestational Age Gene Expression Meta Analysis

By: Esther Malka Laub G'24

Congenital heart disease (CHD) is heart disease that is present at birth. Although congenital heart defects are the most common birth defect in the United States, very little is known about its causes. However, using three publicly available NIH studies, this project explores the gene expression during each of the three trimesters during pregnancy as a baseline study to try to pinpoint those genes that may be the cause of CHD.

The data used is a combined cohort of the three studies which includes cases of healthy infants only to be used as a baseline for future application of a CHD cohort. The data further is split into two datasheets: count data and meta data. The count data is a compilation of the counts of 17,005 different placental genes from each of the 125 placental samples from the three combined studies. In contrast, the meta data accounts for the various variables that were explored in each of the three studies including but not limited to: gestational age, infant sex, and whether the delivery was preterm or not. However, when the analysis reveals a confounding variable, a new analysis plan needs to be developed to produce accurate results.

Video Link: https://youtu.be/Qajb2aUi-eQ

Predictive Analysis of Violent Crime Rates in the US

By: Christina Myers G'24

This study employs machine learning techniques to predict violent crime rates in US states from 2011 to 2019. The primary objective is to identify the factors contributing to these crime rates to optimize resource allocation and prevention strategies. Among the models employed, XGBoost is the most accurate. The research highlights the importance of education in crime prevention, with education levels ranking as the most influential variable. This study also reveals that the District of Columbia has the highest predicted crime rate while Vermont has the lowest. Identifying states with the lowest crime rates allows for them to be used as potential models to lower crime rates. Ultimately, this research serves as a valuable resource for advocating and directing investments into education, paving the way for a safer and more secure future for our communities.

Video Link: https://www.youtube.com/watch?v=Vfykv9DJf

Absolutely Accurate

By: Shawn Smith G'24

Inaccurate case scheduling can cause major disruption to the operating room and result in a significant financial loss. Throughout this project, I analyzed numerous cases and scheduling guidelines to develop different solutions that would improve the hospital surgical case scheduling accuracy. The first problem identified was the measures used to determine the recent average of a surgeon procedure. These measures were severely flawed and required alternative sample sizes to reflect a more accurate recent average. The second problem identified was that no guidelines were established for the scheduling process, especially when 90% of the scheduled cases are done through the contact center.

The contact center is off-campus and communication with the clinical staff in the operating room is limited, so it is extremely important to develop scheduling guidelines to ensure that surgical cases get scheduled accurately. Future work will include restructuring the metrics for the recent average and establishing surgical guidelines to increase our surgical scheduling case accuracy percentage. Improving the scheduling case accuracy percentage will have a significant impact on the operating room prime time utilization and a reduction in same day case cancellation.

Video Link: https://youtu.be/NkzsyzxTPww

Using Machine Learning to Predict the NBA MVP

By: Zachary Williams G'24

In this project, we effectively took raw data from 2 different data sources, FiveThirtyEight and Basketball Reference, and created a regression-based model to predict who would win the 2022-2023 MVP, what statistics were best at predicting the MVP, and comparing the outcome to Sportsbook betting odds. I choose this project because of my love for the sport, the connection to my industry of online sportsbook betting, and the intrigue of seeing how well the data could predict the NBA MVP. The dataset was run against SVR, Random Forest Regressor, Gradient Boosting Regressor, and KNN regressor, and used R-squared and meansquared-error to measure effectiveness. Overall, the results were incredibly positive, as we received estimates that were similar to the actual MVP outcome for the 2022-2023 NBA season, along with creating a rubric for instituting the code and data for the 2023-2024 season. In the future, the plan is to clean the coding for the project, and run the model against the new season data, start working on other sports or awards, and better improve results for the regressors.

Video Link: https://www.youtube.com/watch?v=ysaQnjIeNGA

Predictive Modeling for Stroke Detection: A Comprehensive Healthcare Data Analysis

By: Brianna Tittarelli G'24

This project focuses on predicting strokes in healthcare data through a comprehensive datadriven approach. Beginning with data exploration and preprocessing, the study addressed missing values and encoded categorical variables using one-hot encoding. The dataset was split into features and target variables, followed by further division into training and testing sets. To enhance model performance, standardization and normalization techniques were applied to
the features. To tackle class imbalance, Random Over Sampling was employed. Exploratory data analysis techniques, including histograms, scatter plots, and box plots, were utilized to gain insights into the relationships between variables.

The study employed two machine learning models: Logistic Regression and Random Forest. The Logistic Regression model was trained and evaluated on validation and test sets, showcasing promising results. Subsequently, a Random Forest model was employed, further addressing class imbalance with Random Over-Sampling. Hyperparameter tuning using Grid Search improved the Random Forest model's performance. The final model was selected based on the best hyperparameters and demonstrated robust predictive capabilities. Synthetic Minority Over-sampling Technique (SMOTE) was implemented to handle class imbalance, enhancing
model performance. The project provides a comprehensive framework for predictive modeling in healthcare, emphasizing the significance of data preprocessing, feature engineering, and model selection. The Random Forest model, after hyperparameter tuning, emerged as the most effective predictor of strokes. Overall, this study presents a structured approach to predictive modeling, demonstrating its applicability in healthcare data analysis.

Video Link: https://youtu.be/o--aJ1oknF

Credit Card Fraud Detection

By: Durga Rao Rayapudi G'24

Credit cards plays major role in our day-to-day life as we all mostly use it all the time in-person or online. The usage of credit cards increased drastically with the emergence of internet and ecommerce business. Credit cards are falling into wrong hands either physically or online through several online scams. Email phishing and Data breach are the two main mechanisms intruders use to steal the card information through online. Credit card fraud is a type of identity theft where
an unauthorized user makes a transaction without the cardholder knowledge/approval. Credit card fraud is considered as one of the biggest crimes globally so financial institutes have been trying their best to control it as it is causing severe losses to banks and financial institutes. This loss is not only limited to the banks but also to an individual. I was personally one of the fraud victims in 2016 and there are around 65% people were at least one-time victims of this fraud.
According to Nilson Report, this fraud caused $28.58 billion losses to the financial institutes globally in 2020.

This research aims to identify the credit card fraud at the earlier stage by developing a machine learning model that can predict the given transaction is fraud or not. As this model follows the data science paradigm, so the development of this model happened in multiple phases. In each phase of development, data is understood thoroughly as data is essential and it is mandatory to understand it well in earlier stages, so it helps us to avoid the rework. I started with data collection
phase by collecting the data online by finding the appropriate data source from Kaggle website, which is source of datasets. Data is understood using several statistical, analytical and data visualization techniques. Original data has been updated/removed using several data modification techniques in pre-processing phase. Data is classified into train and test datasets using different algorithms once the data was ready. Model is trained and its accuracy scores are calculated for baseline using Binary Classification algorithms like Random Forest, K-nearest
neighbors, Decision Tree, Logistic Regression, etc. Model is validated/evaluated using test datasets and confirmed its accuracy against train baseline scores. Logistic regression model has done better job with accuracy score 98.2%.

Video Link: https://youtu.be/KTn0Kc-fcB8

Class of 2023

Predicting Store Weekly Sales: A Case Study of Walmart Historical Sales Data

Stephen Boadi G'23

This capstone project aimed to develop a predictive model for weekly sales of different Walmart stores. The dataset was provided by Walmart as part of a Kaggle competition and contained various features including the store size, location, type, and economic indicators. The project used different regression models, including linear regression, random forest regression, and gradient boosting regression, to predict weekly sales.

After data exploration, preprocessing, and feature engineering, the models were trained and evaluated using the training and validation data. Hyperparameter tuning and feature importance analysis were used to improve the performance of the models. The final best model was selected based on its validation error and compared to the top score in the Kaggle competition leaderboard.

The results showed that the Random Forest Regressor Model had the lowest validation error and was chosen as the best model as it showed a strong predictive power for the problem. The key features that influenced the weekly sales were store size, store type, and department within the store. The model was used to predict the weekly sales for the test data, and the best model was evaluated and compared to the Kaggle competition.

Overall, the project demonstrated using various data cleaning, regression models, hyperparameter tuning, and feature importance analysis to develop a predictive model for weekly sales. The final best model showed promising results, but further improvements could be made with more data and feature engineering.

View the Presentation on YouTube

Predicting Burnout: A Workplace Calculator

Jill Anderson G'23

Is it possible to predict and ultimately prevent burnout? When the pandemic began, employers moved their employees to work from home, where possible. More than two years later, many of these employees have not returned to the office. However, some employees, including myself, prefer to work an office environment. I hypothesize this may be associated with burnout. HackerEarth hosted a competition that ran from October 20 to November 19, 2020. This dataset was also used to create a burnout quiz. I took this quiz several times to see how I scored. One of my attempts resulted in a lower score when I increased one of the attributes, how busy I consider myself. Could I create a better model than the survey? Results show that feature selection and regression modeling are efficient for predicting burnout. A predictive model of this type could guide employees and employers and minimize burnout. For example, when an employee approaches a score close to burnout level, they and their manager can have a discussion. This discussion could result in changes that lower the employee’s score before they burn out.

View the Presentation on YouTube

Beyond Artist: Text-to-Image AI

Sarah Snyder G'23

Text-to-image AI is new software that turns a text prompt into stunning images. The prompt can be long or short, detailed or simple, with output that can be in any style or medium the user could imagine. Painting, photography, sculpture, architecture, industrial design, fashion design and more: this new software can do it all in stunning realism that is oftentimes indistinguishable to the real thing. The images are so convincing that even experts in their perspective fields have been fooled when faced with distinguishing real work from AI created work.

Below is a series of 6 images: One out of the six it real and selling for $36,000 at Sotheby’s Auction House, the rest are AI . Can you spot the real painting?

With most of these programs now widely accessible to the public, the art world has been disrupted like never before. What it means to be an artist has been left in free fall as the world decides what the definition of art is. Ethical outrage has erupted over the discovery that these companies are using datasets of images scraped from the internet containing the intellectual property from countless people without their consent or knowledge. Many artists are facing the realistic future of being replaced by AI, while others embrace this new technology.

Follow me as I present the magnitude of these advances, how the software works, uses, applications, controversy, ethics, historical context and more through a captivating 1 hour presentation that takes us beyond the concept of the artist and into the uncharted territory of text-to-image AI.

View the Presentation on YouTube

Predicting the Assessment Value of a Home

Corey Clark G'23

Countless tools have been developed for predicting the sale price of a house; however, predicting the assessment value is a topic yet to be explored thoroughly. Our analysis focused on determining the most important factors for predicting the assessed value of a home, then comparing that with the factors predicting the sales price. Our dataset consisted of parcel information from the municipal office of Warwick, Ri, which was exported from the Vision Government Solutions software using an interface called the Warren Group Extract. We limited our selection to Residential properties with a sale price of over $10,000. Using Lasso and Random Forest regression, we weighted the importance of each feature for predicting both the sale price and the assessed value. The Lasso models were more inconsistent than the Random Forest models. On average, the predicted total assessment using Random Forest was $3,575 less than the actual value. In contrast, the average predicted sale price using Random Forest was $4,250 less than the actual value. Both predictions are within an acceptable range. For both the sale price and total assessed Random Forest models, it was determined that the effective area of the house was the more critical factor. However, as expected, comparing the models resulted in some differences in variable importance. In the assessed value prediction, the Random Forest model identified grade as more important than the total acreage, whereas with the sale price prediction, total acreage had greater importance than grade. Leveraging this model allows municipalities an alternative way to identify the prioritized features stored in their database and helps determine the correct assessed value of their properties with more confidence.

View the Presentation on YouTube

Best In Class Regression: An Analysis of Car Prices

Daniel Duffee G'23

Across the world, cars are used as a means of transportation and the demand for them is huge with there being more than 60 automotive manufacturers globally. With so many different models and features that can be included with them it begs the question, “What is most important in determining the price of a car?”. The project looked to construct a model through regression analysis to answer this question. This study was conducted utilizing a dataset from Kaggle.com that evaluated car prices in the American marketplace across a variety of brands. It consists of 205 entries and contains 25 features ranging from factors affecting the car’s size, performance, engine and more. Multivariate linear regression with recursive feature elimination and random tree forests were used to craft an effective pricing model. Recursive feature elimination was used to get the original dataset down to a more simplistic model where it’s adjusted R-squared value was then tested to see how the model performed. Random tree forests were then used to tackle issues of multicollinearity where RMSE was used to evaluate the new model’s performance. It was found that the most important factors in determining the price of a car were “carwidth”, “curbweight”, “enginesize”, “horsepower”, and “highwaympg.”

View the Presentation on YouTube

The Customer is Always Right: Leveraging Social Media for Customer Feedback

James Kleine-Kracht G'23

The saying goes that "Sales are the voice of the customer," so why not try to leverage their actual voice? In today's world customers are constantly discussing products and services through social media. This project aims to use tweets with specific keywords and apply text mining software to discover real customer feedback. This project focuses on Disney Parks, Experiences, and Products to compare sentiment across multiple avenues. Looking at nearly 700,000 tweets across 27 days, we can compare topics such as Marvel vs. Star Wars and Disney World vs. Disneyland, as well as look at specific topics like Disney Plus or Disney Vacation Club.

View the Presentation on YouTube

Autonomous Maze Solving with a ‘Create2’ Robot

Kyle Kracht G'23

Machine vision can be used to guide a robot through a maze. For the robot to successfully navigate the maze, it must know what turns are available to it at any given time, how to activate its motors to move through the maze, and a strategy of what choices to make. To interpret the visible data, it needs to be processed through a convolutional neural network in real time. That necessitates a network architecture that uses very little RAM and computational power. Additionally, to successfully navigate the maze, it must have a very high accuracy, as it will have to make the correct classification many times in a row. The robot must be given explicit directions to execute its maneuvers in the physical space. This is accomplished by writing Python code that sends a velocity value to each motor, a command to wait for a certain period, and then stop. To turn a certain number of degrees, the encoders must be polled and when a certain difference in encoder counts is read, stopping the motors. Finally, the robot requires an algorithmic way in which to approach its decision making. Thankfully there exists an algorithm for solving any geometrically “simply connected” maze, i.e. one in which the internal walls all connect back to the exterior.

View the Presentation on YouTube

Text-mining a decade of conflicts in Africa

Frankline Owino G'23

Violent conflicts have beset Africa for decades, and for most countries, independence has been achieved. Between the year 1997 to 2017, there were 96717 violent conflicts, and these are only the recorded ones. The violent conflicts in Africa are primarily of three kinds: against civilians, riots and protests, and military battles. Between 2012 and 2017 there was a steady increase in violent conflicts. In fact, the rate of increase is likely higher than reported since the areas where many these conflicts took place are inaccessible.
This project examines unstructured text data from the ACLED and Twitter. The data is used to examine associations between violence and factors such as economic prosperity. Sentiment analysis and supervised learning methods are used to probe issues around the violence and assess motivations that are political, or resource based. Sentiment analysis showed that there was a significantly higher number of negative sentiments over the years. The words killed, armed, and police are among those that featured prominently while the word peace was only mentioned 93 times in 23 years. Predictive models, such as Support Vector Machines, Random Forests, Boosting, kNN, were built to predict fatalities and showed at most a 54% accuracy level. A simple binary representation of fatalities outperformed all other models considered, and although performance was not outstanding, it was found to be better than random.

View the Presentation on YouTube

Class of 2022

Robotic Process Automation

Ivette Lugo, G'22

Robotic Process Automation (RPA) refers to the automation of repetitive, tedious, and high- volume, repetitive human tasks. RPA does not involve any form of physical robots instead it is the software robots which mimic human behavior by interacting with desktop applications as well as on-line systems in the same way that a person does. This automation will allow employees to concentrate on higher-valued tasks and bring higher job satisfaction. There are many companies that offer automation software packages, but UiPath has shown to be the market leader in the RPA world. UiPath can accommodate small companies as well as large corporations. UiPath also offers a free Community Edition and free courses (https://academy.uipath.com) for those interested in learning and becoming certified as a Developer in RPA. Of course, just like with anything RPA has many benefits and a few drawbacks, but the benefits far surpass any shortcomings. We will examine three different bots created specifically for this project. The robots will perform different human tasks to include browsing the internet and scraping weather and real estate data as well creating and sending emails. These will provide a sense of the many things that can be accomplished by this technology in the workplace.

Video: https://youtu.be/Ori6gVMwmVY

Determining Predictors in Student Graduation Rates in U.S. Post-Secondary Education

Anastasia Tabasco-Flores, G'22

The analysis of graduation rates to the percent of need met at public and private bachelor’s programs, utilizes predictive models to determine potential net cost and likelihood of graduation at post-secondary programs. This study considers the graduation rate within six years at bachelor’s programs in the United States, with common features such as gender and race/ethnicity and added specialized considerations such as the percent of financial need met, obtained through the National Center for Education Statistics and the College Board, respectively. The purpose of this study is to consider predictive modeling to determine the feature importance of these variables and determine predictive models to further understand the factors that contribute to successful graduation rates within six years of attending a bachelor’s program. Methods such as regression trees, random forests, and predictive dependence plots were used to consider these models and develop clear plots for model interpretation. Random forests and PDPs were used to analyze the data in clean and concise manner, and to further understand the correlation and dependencies of independent variables to find a best fit model for the years 2011-2020. With this analysis, colleges would be able to better understand the implications of financial percent of need met and their connection to graduation rates to inform policies that are supportive of a diverse student population and can also inform institutional policies around access and persistence. Furthermore, students and families will be able to use this information to make informed decisions about what best fit post-secondary program would be best for them, and hopefully give students and families tools to access professional education while creating generational wealth.

Video: https://youtu.be/tPvUKRgf_9s

Intimate Partner Violence During the COVID-19 Pandemic in Canada

Sanzhar Baiseitov, G'22

In 2020, the COVID-19 pandemic brought economic and psychological stress to families. These factors are known to be associated with Intimate partner violence incidence. Intimate partner violence has two subclasses: Domestic and Sexual Violence. Sexual violence is often treated as a subset of domestic violence. During the COVID-19 pandemic, an increase in Domestic Violence police reports has been registered. However, not all cases are reported to police, and victims engage in an informal way of looking for help or help seeking, like internet search. In this study, Domestic and Sexual violence were treated as separate types. For each of the violence types two questions were asked “Did informal help seeking change during the COVID-19 pandemic?” and “What are important factors in predicting help-seeking behavior?”. Google Trends data from Canada in the years 2017-2020 was used to perform modeling. Univariate and multivariate times series models, as well as the difference-in-differences method, were used to answer the questions of the study. Exponential smoothing univariate time series model was used to create a forecast and compare it with the actual data. The change in the help-seeking trend of before and after the pandemic was analyzed using the difference-in-differences method. Several candidate explanatory variables were included in the multiple linear regression model. A search for the best accurate model was performed. It was found that the informal Domestic Violence help-seeking trend had no change in 2020 under the COVID-19 pandemic conditions. In contrast, the informal Sexual Violence help-seeking trend was different in 2020.

Dedication:
https://www.linkedin.com/feed/update/urn:li:activity:6907033927777681408/

Presentation Video: https://www.youtube.com/watch?v=LOG_RR4fOu0

Miami Dolphins Sentiment Analysis

By: Chelsea Ripoll, G'22, Generalist Track

This project aims to understand the sentiment of tweets from Twitter for the Miami
Dolphins football team. Twitter data was extracted for 5 consecutive games played
between September 19th and October 17th. About 10-15 thousand tweets from each
week were extracted using specific keywords related to the Miami Dolphins, this resulted
in about 50-75 thousand tweets. A Twitter developer account was created to connect to
the Twitter API using Google Colab to retrieve the data. The extracted data from each
week was then stored in a CSV file. The data was then cleaned in Jupyter notebooks
using pre-processing steps such as: removing unnecessary characters, removing
stopwords, applying lemmatization and normalizing the data. Once the data was cleaned,
a word cloud was created to indicate the top 50 words used in each game. In addition, a
list of the top 20 common words and their frequencies was populated.
Sentiment analysis was performed on the datasets from each week using VADER.
SentimentIntensityAnalyzer was imported to calculate the sentiment of each tweet.
Sentiment analysis is the process of determining whether a word or phrase is positive,
negative or neutral. It is often performed on textual data to help companies monitor brand
and product sentiment in customer feedback, understand customer needs and
gauge brand reputation. Lately it has become important for companies to analyze what
their customers are saying about their brand/product on social media. VADER is a rulebased
model which handles social media text well. This is because it can provide
sentiment for emoticons, slang words and acronyms. The polarity score function of
VADER can accurately label words or phrases as “positive”, “negative” or “neutral”. The
model creates a compound score that returns a float for sentiment strength based on the
input text, this value is always between -1 and 1. For this project all neutral tweets were
ignored. Multiple bins were created based on the compound score distribution: positive,
slightly positive, negative and slightly negative. The result of performing sentiment
analysis on all 5 games resulted in a range of 52-58% positive and 42-48% negative
tweets. Even though the Miami Dolphins lost 5 consecutive games most of the tweets
were still positive.

Presentation: https://www.youtube.com/watch?v=2GL-hpfKypY

Improving Digital Criteria to Aid in Enrollment & Surveillance for a Process Improvement Program in a Healthcare System

By: Omar Badran, G'22, Specialist Track

Electronic medical records (EMR) contain copious amounts of data that are collected during a patient’s hospital visit. During hospital visits there are processes in place to assure the patient is receiving adequate care and that it is delivered efficiently. Structured family communication programs are in place to provide support and resources to families of patients that present more severe illness and risk. Criteria used for admission into this program can be viewed as subjective and underperforms by either missing patients that should be in the program or using resources on patients that do not belong in the program. The objective of this work is to leverage machine learning approaches in connection with EMR data to automate and improve admission into surveillance programs. Results show that EMR data combined with machine learning techniques drastically improve sensitivity and specificity by 22% and 45% respectively, when compared to the existing program in place. Moreover, the set of variables for classifying
a patient into the program are shown to vary and depend critically on the days of stay. An automated machine learning approach of this type can guide clinical providers to identify patients for structured family communication programs and assist hospital leadership with program surveillance across the healthcare system.

Presentation: https://www.youtube.com/watch?v=k9i5ySuXllg

A Dive into Reinforcement Learning

By: Evan Reeves, G'22, Specialist Track

Reinforcement learning is one of the three basic types of machine learning in which
agents learn how to perform in an environment via trial and error. This project focuses on taking an introductory look into the topic, exploring its intricacies, and concludes with the implementation of an inverted double pendulum. Along the way three important algorithms are explored (QLearning, Deep QLearning, and Policy Gradiants) as well as some real-world use cases for reinforcement learning. Application of methods to the problem of the double pendulum is implemented to demonstrate the training process.

Presentation: https://www.youtube.com/watch?v=7UfnuNF5C9Y

COVID-19 Interventions and Online Learning Engagement in the Northeast

By: Jonathon Royce, G'22, Specialist Track

COVID-19 and the havoc it wreaked on the world in 2020 need no introduction. Countries locked down, businesses closed their doors, and people were mandated to stay at home.

Globally some 219 million cases of COVID were contracted, with about 4.55 million
deaths. The severe impact on the education in school aged children was universally
detrimental and the degree of detriment varied across counties and states.

Understanding how to maximize educational engagement during pandemics can enable preparedness that may improve learning outcomes for future pandemics, or disasters, that would alter the traditional school setting to an online venue. In this presentation, we’ll be examining how the United States’ efforts in 2020 to combat COVID-19 affected engagement in online learning. The focus will be primarily on the northeastern states. Exploratory data analysis and statistical models will be used to identify the critical factors that play a role in estimating online engagement.

Presentation: https://www.youtube.com/watch?v=T7HLsYTMUf4&feature=youtu.be

March Machine Learning Mania

By: Nick Tornatore, G'22, Specialist Track

Dating back to 1939, NCAA Men’s Division I basketball’s annual champion has been
decided through a tournament following the conclusion of the regular season. Since 1985, the annual tournament has included 64 teams, giving rise to the modern brackets so closely associated with the tradition famously called “March Madness”.

Each year an estimated 70 million brackets are submitted to a variety of bracket contests, all with dreams of predicting every step of the tournament correctly. These dreams are promptly crushed, as the probability of ever devising a perfect bracket are so infinitesimally low, Warren Buffett has famously offered a $1 billion dollar prize on occasion for any who could accomplish the feat. To date, none have come close to succeeding.

As part of the cultural phenomenon that is March Madness, Kaggle hosts an annual
competition to create a statistical model that best predicts the outcomes of the annual basketball tournament, but with a slight twist in measuring success compared to the traditional bracket competitions. This project employs a variety of modeling approaches with unique perspectives on evaluating the strength of a team as the tournament begins.

Model performance is benchmarked and reviewed against historical tournaments and their respective Kaggle leaderboards. Additionally, strong considerations are given for how to better refine the analysis for future tournaments, embodying the spirit of March Madness in striving to eventually be declared champion.

Presentation: https://www.youtube.com/watch?v=UP-ZLILU-50&feature=youtu.be

Introduction to Cluster Stability Using the Clusterboot Function

By: Michael Tucker, G'22, Specialist Track

Clustering data is a challenging problem in unsupervised learning where there is no
gold standard. The selection of a clustering method, measures of dissimilarity,
parameters, and the determination of the number of reliable groupings, is often viewed as a subjective process. Stability has become a valuable surrogate to performance and robustness that is useful for guiding an investigator to reliable and reproducible clusters. This project walks through some of the core concepts of stability and demonstrates them in the R programming language with the “clusterboot” function.

Presentation: https://www.youtube.com/watch?v=Mx-k-pyx_m4

Class of 2021

The Demographics of Unemployment

By: Alison Delano, G'21, Generalist Program

This project utilizes time-series analysis to compare changes of unemployment percentages with education level. The data came from the US Bureau of Labor Statistics. I was interested in how a fluctuation in the economy would affect unemployment. I worked with unemployment data for twenty years, from January 2000 to December 2020, and ages twenty-five and up. I compared unemployment data for a high school diploma, less than a high school diploma, Bachelors, Masters, and Doctoral. The data was divided up by year and month and there were 252 observations per education level. I was interested in comparing the change in unemployment be education level and also forecasting the unemployment percentage. During these twenty years, there was a recession and the start of a pandemic

Presentation: https://www.youtube.com/watch?v=y8096xZZNv0&feature=youtu.be

The Price of Admission

By: Emily Leech, G'21, Generalist Program

Private school admissions have recently been a topic of hot debate in the media. While under a microscope due to less than ethical choices made by select individuals in the college community, independent school admissions truly are a gamble for all involved. The ability to predict who will submit an application and in turn who will accept an offer of admission is a common argument in conversations with colleagues, families, consultants, and other schools. Leveraging data mining techniques, data science may provide some insights to the new age of admissions. To have an idea of where a school should strategically target campaigning efforts when budgets are limited, and offices are small is essential to a school’s success. In this work, I will utilize admissions survey data that is collected from applicants, which changes trends annually. The data is for Suffield Academy in the year 2020-2021. I will discuss the concepts of both classification and random forests, and how these methods may aid in better forecasting admissions applications and perhaps yield each year. Results based on variable importance will be used to prioritize important variables in the dataset. Through this presentation, I will also discuss the important problem of class imbalance, a common problem with datasets and application of algorithms, and how to overcome it with down-sampling.

Presentation: https://www.youtube.com/watch?v=oZkj3ZaE_vs&feature=youtu.be

Consumer Expenditure and the Stock Market: Time Series Analysis Predictions

By: Dane Martin, G'21, Generalist Program

This study looks at the Dow Jones Industrial Average (DJIA) return over a 20-year period with actual monthly data available on many financial sites, specifically Yahoo Finance. Forecasting models are well suited for modeling data of this type because they leverage observed changes and trends enables us to make future predictions. The purpose of this study is to explore different univariate models’ outputs using the Naïve, Simple Exponential Smoothing, HoltWinter, ARIMA and the TBATS methods to determine the best fit by calculating the Mean Average Percentage Error for each model. The MAPE is a measure of prediction accuracy of a forecasting method and we have run it on all the univariate data of the DJIA and come up with the best performing model after comparing the outputs. We also bring in the economic indicator data for consumer Price Index (CPI) to then use multivariate time series analysis that has more than one time-dependent variable and compares the DJIA and CPI data showing some dependency on each other. The vector autoregression (VAR) model is one of the most flexible and easy to use models for the analysis of multivariate time series. With these understandings, individuals can look at the seasonal variance and steady flow of any index to understand and decide to invest in the stock market.

Presentation: https://www.youtube.com/watch?v=VOGq8TL9IoQ&feature=youtu.be

Understanding the Demographics of Breastfeeding

By: Hope Shiryah Suplita, G'21, Generalist Program

There are multiple factors that impact the duration a child is breastfed for. This project will attempt to test out two different hypotheses to see if race, specifically, impacts breastfeeding duration. Specifically, I will (1) determine if race plays a significant role in the length that a person breastfeeds for, and (2) determine if among those that breastfeed, if there is a significant difference in the race of those that elect to do so for over six months. Data modification were made in order to ensure the race categories could successfully be tested. In order to investigate these hypotheses, Various testing and modeling concepts in R, including ANOVA, linear regression models and Fisher’s exact tests were used to determine if race does indeed have an impact on breastfeeding duration. Results show that race play a significant role in the various models considered. Understanding the impact on race and breastfeeding length may ultimately influence how breastfeeding resources, support and educational resources are distributed.

Presentation: https://www.youtube.com/watch?v=zW-1hNACzA8&feature=youtu.be

Analyzing Churn for ITRAC LLC – A Dental Practice Solution Company

By: Matthew Reichart, G'21 Specialist Program

Itrac LLC is a dental solutions company which offers a product that helps dental practices set up an insurance alternative membership program. Like many companies, itrac has data that is not fully utilized, and could be better understood using data science techniques or example, an Exit Survey of people leaving the membership program exists but is not streamlined. It was created with an option called “Other” which does not give direct information. This project utilizes data science methods such as classification, natural language processing and exploratory data analysis to examine survey data and practice characteristics in an effort to understand better understand churn and high activity within the dental program.

Not all practices who use the membership program use it forever. When a large practice stops using the program this creates revenue loss. By using EDA to allow for Feature
Engineering a basis for whether a practice is canceled or not was established. From there Random Forests were used to understand variable importance and accuracy of predicting the status of a practice. The status being “Active”, “High Performing”, “Churned”, or “Future Churn”. These results were used to interpret results of applying the APRIORI algorithm, which aimed to create rules/flags for these statuses. Results indicate that the most important variable related to Churn are the “Change from the Maximum”, and for High Performers it is the “Patient Count”. With backwards selection, linear models were created for each. The analysis shows promising results and association that may guide future investigations.

Presentation: https://www.youtube.com/watch?v=O0RJbKECgp4

Characterizing Unemployment in the US with Timeseries Analysis and Unsupervised Learning

By: Curtis Stone, G'21, Specialist Program

The unemployment rate in the US and its individual states is a very important thing to consider by those in governmental leadership roles. However, it is highly volatile and connected to many other factors. In this project, I utilize a combination of timeseries analysis and unsupervised learning techniques to explore data to identify patterns of similarity and correlation patterns between different industries, states, and the United States overall. I also seek to identify the variation explained within each topic of interest for further insights. Finally, various visualizations of the data are available and can be modified for easy consumption of the data. I demonstrate how these resources can be used to extract insights from many possible questions that can be asked. States that prove to be the most dissimilar are extracted and the reason for this behavior is determined, seasonality within industries is explored, variation within regions is identified, and how unemployment and the S&P 500 interact is demonstrated. This approach and accompanying tools can be utilized to identify patterns of risk during events and timeframes, and action may be taken to improve unemployment rates or assist those at highest risk.

Presentation: https://youtu.be/R1b7nGrhNRQLink

Tableau (visualization tool showcased in the presentation):
https://public.tableau.com/profile/curtis.stone7796#!/vizhome/UnemploymentCapstone_1614030
4530220/USUnemployment?publish=yes

Link to Github (complete files and code used to create this project):
https://github.com/CurtisPS/Unemployment-in-the-US-Capstone

Presidential Inauguration: A Sentiment Analysis

By: Jameson Van Zetten, G'21, Specialist Program

The aim of this project is to develop a working tool that can ingest, pre-process, classify, and analyze the sentiment of live data related to the 2021 President Inauguration from Twitter. This project utilizes Python for JSON data ingestion and interpretation from the Twitter API, which is then stored in a database using MySQL. This data is then cleaned and analyzed utilizing the TextBlob package in Python, which is based on the Natural Language Toolkit. This data is analyzed for two key metrics: Polarity and Subjectivity. Polarity assesses the overall emotion contained in the text, while Subjectivity assesses whether the statements are factual or based on emotion/opinion. This data was then related back to the overall topic of the tweet to help gain insight additional insight on the topic. After collection and processing this data, Tableau was used to assist in the visualization and interpretation via a custom dashboard.

Keywords: Sentiment analysis, presidential inauguration, API, Python, TextBlob

Presentation: https://youtu.be/zTnDVN3PW24

Medical Endoscope Device Complaints

By: Paul Waterman, G'21, Specialist Program

Human emotions are enormously powerful. The best brands always connect with their consumers and appeal to human emotions. Sentiment analysis can be a powerful tool for
enhancing not only customer experience but also brand management, marketing strategy, and new product development initiatives.

Customer complaints can provide a wealth of information. Often customer complaint descriptive history is simply ignored because of not knowing what to do with the narrative. Traditional statistical methods are often used to track the number of complaints without really understanding the words and emotions behind the complaint. We throw away complaint text data, but it can be a gold mine of valuable learnings and sentiments. The evolution of natural language processing (NLP) techniques allows us to dive into the complaint narrative and extract human emotion that could lead to a better understanding of customer perception towards an organization or product.

This presentation is an exploration of humanizing complaints related to medical endoscopes. Two sources of data used are the FDA Manufacturer and User Facility Device (MAUDE) public data and the second is an internal database maintained by the company. The data is subjected to NLP techniques of stemming, lemmatizing, stop words, etc., followed by various clustering techniques and finally a sentiment analysis to evoke human emotions behind the complaint.

Presentation: https://www.youtube.com/watch?v=1hcwMkRffYk&feature=youtu.be

Class of 2020

US Food Environments: Clustering and Analysis

By: Stephanie Cuningham, G'20

“Food environment” is the culmination of socioeconomic and physical factors which shape the food habits of a population. The USDA’s 2020 “Food Environmental Atlas” (FEA) contains county-level data on 281 of these factors. A central theme in the discussion of food environment has been the idea of a “food desert,” generally defined as an area that is both low-income and low-access. However, recent research calls this question, with claims that choices drive poor health outcomes, rather than lack of nutritional access. In this analysis, the FEA along with 2018 Census data on labor participation rates and educational attainment were examined to determine whether clustering areas by common food environment can reveal how underlying factors and health outcomes (here, diabetes rates) vary beyond food desert classification.

Bayesian Principal Component Analysis (PCA) and K-means clustering were performed with counties stratified by metro status. This yielded six clusters, plotted against the strongest loadings for each component to create biplots, and on choropleths. Clusters’ relationships to food desert diabetes rate were statistically analyzed. Random forests were run against these two outcomes to compare variable importance rankings. Findings showed that demographics (age, race, educational attainment) are more important to health outcome than to food desert status, and labor participation is more important to food desert status than to health outcome. SNAP benefit participation is important to both outcomes. Overall, results indicated that while food environment, food desert status, and diabetes rate are inter-related, certain factors associated with food desert status are more correlated with poor health outcomes than others. While poverty and access are important to food environment, this is not a simple causal relationship and further clustering work could reveal more personalized solutions for populations in need.

Keywords: non-profit, data processing, Alteryx, data integration

Watch the Presentation

Data Preprocessing for Non-profits

By: Josephine Hansen, G'20

For Non-profits their data is very important and essential for being able to get new grants and get their support out to more individuals in need of help. The major issue with non-profit data is that this has not been historically a major priority. Consequently, the data is not clean and cannot be easily transferred into a new site that will help them keep track of their data, prioritize their findings, and be able to track their outreach to individuals in need. Data pre-processing is very important but extremely time consuming. For a non-profit, this unfortunately equates to unnecessary expense associated with time expended cleaning the data files. This project addresses this issue though the development of a tool called Alteryx that is exceptional at saving time, in turn this will help non-profits save money that would go towards the data preprocessing and cleansing. This strategy allows new data to be inserted into pre-designed dataflows that within seconds will perform the needed tasks in order to have the data cleaned and ready to go to be migrated into their new site.

Keywords: non-profit, data processing, Alteryx, data integration

Watch the Presentation

Something of a Painter Myself

By: Khalil Parker, G'20

The goal of the project is to build a generative adversarial network that is able to create new instances of Monet art. This project is part of an ongoing Kaggle competition. The approach that I take relies on the use of a cycleGAN to transfer the style of a Monet onto a photorealistic image. CycleGAN is made up of four neural networks, two of which are generators and the other two are discriminators. The generators are used to create new images. The discriminators are used to tell if an image is real or fake. While training, the generators and discriminator are used against each other to increase the accuracy. If the discriminators learn how to detect the generated images, then the generator will adjust its weights to create better images. If the generators learn how to fool the discriminators, then the discriminator adjust its weights to detect fake images better. This process continues until the generator can create images that look like Monet painting. Two architectures, Resnet and Unet are considered, and results and comparisons are made.

Keywords: CycleGAN, GAN, Monet, Kaggle, Resnet, Unet

Watch the Presentation

Sentiment Analysis of Tweets Originating from Nigeria

Sentiment Analysis is the process of understanding the opinion of an author about a subject. For a successful sentiment analysis, the opinion (or emotion), the subject of discussion (what is being talked about), and the opinion holder are the main components required. For this project, a sentiment analysis was performed on data originating from Nigeria. Twitter data was extracted for the date range of October 4th - 12th. Over 1 million tweets were extracted during this period and analyzed. R Studio, Alteryx, SQL Lite and Tableau were used to extract, store, and transform the data as well as to perform the data visualization. Once data is extracted, it had to be preprocessed and cleansed in order to strip out irrelevant words. Sentiment Analysis in R was performed using 3 different sentiment libraries in R - Bing (positive/negative sentiments), NRC (multiple emotions) and AFINN (polarities). The sentiment libraries correctly and successfully picked up the negative sentiments arising from the recent and ongoing protests against police brutality in Nigeria. In addition to the sentiment analysis, the top 50 and 75 words were also mined from the data with the most common words in the over one million tweets being “god” and ”sars”, again successfully recognizing the key words from the protests.

Keywords: Twitter, Nigeria, sentiment analysis, topic analysis, dynamic sentiments

Watch the Presentation

Sentiment Analysis on Twitter – a generalizable approach with applications to COVID vaccines

By: Patrick Fitzgerald, G'20

Sentiment Analysis is the automated process of identifying and classifying subjective information in text data and sorting it into segments. Using sentiment analysis tools to analyze opinions on Twitter data can help to understand how people are talking and interacting online with a certain topic. It can help to better understand an audience, keep on top of what’s being said for a given topic, and discover new trends related to that topic. The most common type of sentiment analysis is ‘Polarity Detection’ which involves classifying a statement as ‘Positive’, ‘Negative’, or ‘Neutral’.Specifically, the TextBlob Python package was used to determine the Polarity, Subjectivity, and Sentiment of tweets pulled over a several week period relating to the ongoing development of a Coronavirus Vaccine. The Polarity indicates a nuanced approach to viewing the sentiment of a statement whereas the Subjectivity works to identify whether the statement is a fact or opinion. Then, the Sentiment is classified based on the Polarity of the given statement. Finally, an emphasis was placed on scalability and reproducibility. By introducing sentiment analysis tools into a workflow; unstructured tweet data was automatically pulled, cleaned, and classified in real-time, at scale, and accurately. This allows for a wide range of possible topics to be plugged into the workflow and does not limit the scope to just the single question posed in this project.

Keywords: Twitter, sentiment analysis, data processing, COVID

Watch the Presentation

Housing Price Prediction: App development

By: Robert Anzalone, G'20

This project aims to provide a predictor for the sale price of a single-family home that is generalized to markets within the United States. Models leverage postal code geospatial information to capture the necessary granularity of the markets. The intent of the resulting application is not to provide the most realistic and precise prediction, but rather to allow users to form a “ballpark” estimate, as well as to exploit the effects of the input features on the projected sales prices from market to market. In the event that the home has not been sold before, it if it is located in an area where few other comparable homes have sold, a generalized model of this type is well-suited to provide a “baseline” home value.

Keywords: Housing, prediction, API, data integration, geospatial data, Shiny

Watch the Presentation

Housing Prediction App

Please visit the MS in Applied Data Science program page to learn about the two tracks, curriculum, faculty, program options, and more!