Applied Data Science (MS) Student Capstone Projects
Case Analysis Capstone (ADS670) aims to develop both technical and soft skills that are not directly taught in the traditional courses in the program, but are relevant and critical in order to develop, innovate and communicate in modern data science. This is a project-oriented capstone that will harness the skills gained throughout the program.
Below are some examples of original research studies done by students in our master's in Applied Data Science program for their completed capstone projects.
Predicting Store Weekly Sales: A Case Study of Walmart Historical Sales Data
Stephen Boadi G'23 This capstone project aimed to develop a predictive model for weekly sales of different Walmart stores. The dataset was provided by Walmart as part of a Kaggle competition and contained various features including the store size, location, type, and economic indicators. The project used different regression models, including linear regression, random forest regression, and gradient boosting regression, to predict weekly sales.
After data exploration, preprocessing, and feature engineering, the models were trained and evaluated using the training and validation data. Hyperparameter tuning and feature importance analysis were used to improve the performance of the models. The final best model was selected based on its validation error and compared to the top score in the Kaggle competition leaderboard.
The results showed that the Random Forest Regressor Model had the lowest validation error and was chosen as the best model as it showed a strong predictive power for the problem. The key features that influenced the weekly sales were store size, store type, and department within the store. The model was used to predict the weekly sales for the test data, and the best model was evaluated and compared to the Kaggle competition.
Overall, the project demonstrated using various data cleaning, regression models, hyperparameter tuning, and feature importance analysis to develop a predictive model for weekly sales. The final best model showed promising results, but further improvements could be made with more data and feature engineering.
View the Presentation on YouTube
Predicting Burnout: A Workplace Calculator
Jill Anderson G'23
Is it possible to predict and ultimately prevent burnout? When the pandemic began, employers moved their employees to work from home, where possible. More than two years later, many of these employees have not returned to the office. However, some employees, including myself, prefer to work an office environment. I hypothesize this may be associated with burnout. HackerEarth hosted a competition that ran from October 20 to November 19, 2020. This dataset was also used to create a
burnout quiz. I took this quiz several times to see how I scored. One of my attempts resulted in a lower score when I increased one of the attributes, how busy I consider myself. Could I create a better model than the survey? Results show that feature selection and regression modeling are efficient for predicting burnout. A predictive model of this type could guide employees and employers and minimize burnout. For example, when an employee approaches a score close to burnout level, they and their manager can have a discussion. This discussion could result in changes that lower the employee’s score before they burn out.
View the Presentation on YouTube
Beyond Artist: Text-to-Image AI
Sarah Snyder G'23
Text-to-image AI is new software that turns a text prompt into stunning images. The prompt can be long or short, detailed or simple, with output that can be in any style or medium the user could imagine. Painting, photography, sculpture, architecture, industrial design, fashion design and more: this new software can do it all in stunning realism that is oftentimes indistinguishable to the real thing. The images are so convincing that even experts in their perspective fields have been fooled when faced with distinguishing real work from AI created work.
Below is a series of 6 images: One out of the six it real and selling for $36,000 at Sotheby’s Auction House, the rest are AI . Can you spot the real painting?
With most of these programs now widely accessible to the public, the art world has been disrupted like never before. What it means to be an artist has been left in free fall as the world decides what the definition of art is. Ethical outrage has erupted over the discovery that these companies are using datasets of images scraped from the internet containing the intellectual property from countless people without their consent or knowledge. Many artists are facing the realistic future of being replaced by AI, while others embrace this new technology.
Follow me as I present the magnitude of these advances, how the software works, uses, applications, controversy, ethics, historical context and more through a captivating 1 hour presentation that takes us beyond the concept of the artist and into the uncharted territory of text-to-image AI.
View the Presentation on YouTube
Predicting the Assessment Value of a Home
Corey Clark G'23
Countless tools have been developed for predicting the sale price of a house; however, predicting the assessment value is a topic yet to be explored thoroughly. Our analysis focused on determining the most important factors for predicting the assessed value of a home, then comparing that with the factors predicting the sales price. Our dataset consisted of parcel information from the municipal office of Warwick, Ri, which was exported from the Vision Government Solutions software using an interface called the Warren Group Extract. We limited our selection to Residential properties with a sale price of over $10,000. Using Lasso and Random Forest regression, we weighted the importance of each feature for predicting both the sale price and the assessed value. The Lasso models were more inconsistent than the Random Forest models. On average, the predicted total assessment using Random Forest was $3,575 less than the actual value. In contrast, the average predicted sale price using Random Forest was $4,250 less than the actual value. Both predictions are within an acceptable range. For both the sale price and total assessed Random Forest models, it was determined that the effective area of the house was the more critical factor. However, as expected, comparing the models resulted in some differences in variable importance. In the assessed value prediction, the Random Forest model identified grade as more important than the total acreage, whereas with the sale price prediction, total acreage had greater importance than grade. Leveraging this model allows municipalities an alternative way to identify the prioritized features stored in their database and helps determine the correct assessed value of their properties with more confidence.
View the Presentation on YouTube
Best In Class Regression: An Analysis of Car Prices
Daniel Duffee G'23
Across the world, cars are used as a means of transportation and the demand for them is huge with there being more than 60 automotive manufacturers globally. With so many different models and features that can be included with them it begs the question, “What is most important in determining the price of a car?”. The project looked to construct a model through regression analysis to answer this question. This study was conducted utilizing a dataset from Kaggle.com that evaluated car prices in the American marketplace across a variety of brands. It consists of 205 entries and contains 25 features ranging from factors affecting the car’s size, performance, engine and more. Multivariate linear regression with recursive feature elimination and random tree forests were used to craft an effective pricing model. Recursive feature elimination was used to get the original dataset down to a more simplistic model where it’s adjusted R-squared value was then tested to see how the model performed. Random tree forests were then used to tackle issues of multicollinearity where RMSE was used to evaluate the new model’s performance. It was found that the most important factors in determining the price of a car were “carwidth”, “curbweight”, “enginesize”, “horsepower”, and “highwaympg.”
View the Presentation on YouTube
The Customer is Always Right: Leveraging Social Media for Customer Feedback
James Kleine-Kracht G'23
The saying goes that "Sales are the voice of the customer," so why not try to leverage their actual voice? In today's world customers are constantly discussing products and services through social media. This project aims to use tweets with specific keywords and apply text mining software to discover real customer feedback. This project focuses on Disney Parks, Experiences, and Products to compare sentiment across multiple avenues. Looking at nearly 700,000 tweets across 27 days, we can compare topics such as Marvel vs. Star Wars and Disney World vs. Disneyland, as well as look at specific topics like Disney Plus or Disney Vacation Club.
View the Presentation on YouTube
Autonomous Maze Solving with a ‘Create2’ Robot
Kyle Kracht G'23
Machine vision can be used to guide a robot through a maze. For the robot to successfully navigate the maze, it must know what turns are available to it at any given time, how to activate its motors to move through the maze, and a strategy of what choices to make. To interpret the visible data, it needs to be processed through a convolutional neural network in real time. That necessitates a network architecture that uses very little RAM and computational power. Additionally, to successfully navigate the maze, it must have a very high accuracy, as it will have to make the correct classification many times in a row. The robot must be given explicit directions to execute its maneuvers in the physical space. This is accomplished by writing Python code that sends a velocity value to each motor, a command to wait for a certain period, and then stop. To turn a certain number of degrees, the encoders must be polled and when a certain difference in encoder counts is read, stopping the motors. Finally, the robot requires an algorithmic way in which to approach its decision making. Thankfully there exists an algorithm for solving any geometrically “simply connected” maze, i.e. one in which the internal walls all connect back to the exterior.
View the Presentation on YouTube
Text-mining a decade of conflicts in Africa
Frankline Owino G'23
Violent conflicts have beset Africa for decades, and for most countries, independence has been achieved. Between the year 1997 to 2017, there were 96717 violent conflicts, and these are only the recorded ones. The violent conflicts in Africa are primarily of three kinds: against civilians, riots and protests, and military battles. Between 2012 and 2017 there was a steady increase in violent conflicts. In fact, the rate of increase is likely higher than reported since the areas where many these conflicts took place are inaccessible.
This project examines unstructured text data from the ACLED and Twitter. The data is used to examine associations between violence and factors such as economic prosperity. Sentiment analysis and supervised learning methods are used to probe issues around the violence and assess motivations that are political, or resource based. Sentiment analysis showed that there was a significantly higher number of negative sentiments over the years. The words killed, armed, and police are among those that featured prominently while the word peace was only mentioned 93 times in 23 years. Predictive models, such as Support Vector Machines, Random Forests, Boosting, kNN, were built to predict fatalities and showed at most a 54% accuracy level. A simple binary representation of fatalities outperformed all other models considered, and although performance was not outstanding, it was found to be better than random.
View the Presentation on YouTube
Robotic Process Automation
Ivette Lugo, G'22 Robotic Process Automation (RPA) refers to the automation of repetitive, tedious, and high- volume, repetitive human tasks. RPA does not involve any form of physical robots instead it is the software robots which mimic human behavior by interacting with desktop applications as well as on-line systems in the same way that a person does. This automation will allow employees to concentrate on higher-valued tasks and bring higher job satisfaction. There are many companies that offer automation software packages, but UiPath has shown to be the market leader in the RPA world. UiPath can accommodate small companies as well as large corporations. UiPath also offers a free Community Edition and free courses (https://academy.uipath.com) for those interested in learning and becoming certified as a Developer in RPA. Of course, just like with anything RPA has many benefits and a few drawbacks, but the benefits far surpass any shortcomings. We will examine three different bots created specifically for this project. The robots will perform different human tasks to include browsing the internet and scraping weather and real estate data as well creating and sending emails. These will provide a sense of the many things that can be accomplished by this technology in the workplace.
Determining Predictors in Student Graduation Rates in U.S. Post-Secondary Education
Anastasia Tabasco-Flores, G'22 The analysis of graduation rates to the percent of need met at public and private bachelor’s programs, utilizes predictive models to determine potential net cost and likelihood of graduation at post-secondary programs. This study considers the graduation rate within six years at bachelor’s programs in the United States, with common features such as gender and race/ethnicity and added specialized considerations such as the percent of financial need met, obtained through the National Center for Education Statistics and the College Board, respectively. The purpose of this study is to consider predictive modeling to determine the feature importance of these variables and determine predictive models to further understand the factors that contribute to successful graduation rates within six years of attending a bachelor’s program. Methods such as regression trees, random forests, and predictive dependence plots were used to consider these models and develop clear plots for model interpretation. Random forests and PDPs were used to analyze the data in clean and concise manner, and to further understand the correlation and dependencies of independent variables to find a best fit model for the years 2011-2020. With this analysis, colleges would be able to better understand t he implications of financial percent of need met and their connection to graduation rates to i nform policies that are supportive of a diverse student population and can also inform institutional policies around access and persistence. Furthermore, students and families will be able to use this information to make informed decisions about what best fit post-secondary program would be best for them, and hopefully give students and families tools to access professional education while creating generational wealth.
Intimate Partner Violence During the COVID-19 Pandemic in Canada
Sanzhar Baiseitov, G'22 In 2020, the COVID-19 pandemic brought economic and psychological stress to families. These factors are known to be associated with Intimate partner violence incidence. Intimate partner violence has two subclasses: Domestic and Sexual Violence. Sexual violence is often treated as a subset of domestic violence. During the COVID-19 pandemic, an increase in Domestic Violence police reports has been registered. However, not all cases are reported to police, and victims engage in an informal way of looking for help or help seeking, like internet search. In this study, Domestic and Sexual violence were treated as separate types. For each of the violence types two questions were asked “Did informal help seeking change during the COVID-19 pandemic?” and “What are important factors in predicting help-seeking behavior?”. Google Trends data from Canada in the years 2017-2020 was used to perform modeling. Univariate and multivariate times series models, as well as the difference-in-differences method, were used to answer the questions of the study. Exponential smoothing univariate time series model was used to create a forecast and compare it with the actual data. The change in the help-seeking t rend of before and after the pandemic was analyzed using the difference-in-differences method. Several candidate explanatory variables were included in the multiple linear regression model. A search for the best accurate model was performed. It was found that the informal Domestic Violence help-seeking trend had no change in 2020 under the COVID-19 pandemic conditions. In contrast, the informal Sexual Violence help-seeking trend was different in 2020.
Miami Dolphins Sentiment Analysis
By: Chelsea Ripoll, G'22, Generalist Track
This project aims to understand the sentiment of tweets from Twitter for the Miami
Dolphins football team. Twitter data was extracted for 5 consecutive games played between September 19th and October 17th. About 10-15 thousand tweets from each week were extracted using specific keywords related to the Miami Dolphins, this resulted in about 50-75 thousand tweets. A Twitter developer account was created to connect to the Twitter API using Google Colab to retrieve the data. The extracted data from each week was then stored in a CSV file. The data was then cleaned in Jupyter notebooks using pre-processing steps such as: removing unnecessary characters, removing stopwords, applying lemmatization and normalizing the data. Once the data was cleaned, a word cloud was created to indicate the top 50 words used in each game. In addition, a list of the top 20 common words and their frequencies was populated. Sentiment analysis was performed on the datasets from each week using VADER. SentimentIntensityAnalyzer was imported to calculate the sentiment of each tweet. Sentiment analysis is the process of determining whether a word or phrase is positive, negative or neutral. It is often performed on textual data to help companies monitor brand and product sentiment in customer feedback, understand customer needs and gauge brand reputation. Lately it has become important for companies to analyze what their customers are saying about their brand/product on social media. VADER is a rulebased model which handles social media text well. This is because it can provide sentiment for emoticons, slang words and acronyms. The polarity score function of VADER can accurately label words or phrases as “positive”, “negative” or “neutral”. The model creates a compound score that returns a float for sentiment strength based on the input text, this value is always between -1 and 1. For this project all neutral tweets were ignored. Multiple bins were created based on the compound score distribution: positive, slightly positive, negative and slightly negative. The result of performing sentiment analysis on all 5 games resulted in a range of 52-58% positive and 42-48% negative tweets. Even though the Miami Dolphins lost 5 consecutive games most of the tweets were still positive.
Improving Digital Criteria to Aid in Enrollment & Surveillance for a Process Improvement Program in a Healthcare System
By: Omar Badran, G'22, Specialist Track
Electronic medical records (EMR) contain copious amounts of data that are collected
during a patient’s hospital visit. During hospital visits there are processes in place to assure the patient is receiving adequate care and that it is delivered efficiently. Structured family communication programs are in place to provide support and resources to families of patients that present more severe illness and risk. Criteria used for admission into this program can be viewed as subjective and underperforms by either missing patients that should be in the program or using resources on patients that do not belong in the program. The objective of this work is to leverage machine learning approaches in connection with EMR data to automate and improve admission into surveillance programs. Results show that EMR data combined with machine learning techniques drastically improve sensitivity and specificity by 22% and 45% respectively, when compared to the existing program in place. Moreover, the set of variables for classifying a patient into the program are shown to vary and depend critically on the days of stay. An automated machine learning approach of this type can guide clinical providers to identify patients for structured family communication programs and assist hospital leadership with program surveillance across the healthcare system.
A Dive into Reinforcement Learning
By: Evan Reeves, G'22, Specialist Track
Reinforcement learning is one of the three basic types of machine learning in which
agents learn how to perform in an environment via trial and error. This project focuses on taking an introductory look into the topic, exploring its intricacies, and concludes with the implementation of an inverted double pendulum. Along the way three important algorithms are explored (QLearning, Deep QLearning, and Policy Gradiants) as well as some real-world use cases for reinforcement learning. Application of methods to the problem of the double pendulum is implemented to demonstrate the training process.
COVID-19 Interventions and Online Learning Engagement in the Northeast
By: Jonathon Royce, G'22, Specialist Track
COVID-19 and the havoc it wreaked on the world in 2020 need no introduction. Countries
locked down, businesses closed their doors, and people were mandated to stay at home. Globally some 219 million cases of COVID were contracted, with about 4.55 million deaths. The severe impact on the education in school aged children was universally detrimental and the degree of detriment varied across counties and states. Understanding how to maximize educational engagement during pandemics can enable preparedness that may improve learning outcomes for future pandemics, or disasters, that would alter the traditional school setting to an online venue. In this presentation, we’ll be examining how the United States’ efforts in 2020 to combat COVID-19 affected engagement in online learning. The focus will be primarily on the northeastern states. Exploratory data analysis and statistical models will be used to identify the critical factors that play a role in estimating online engagement.
March Machine Learning Mania
By: Nick Tornatore, G'22, Specialist Track
Dating back to 1939, NCAA Men’s Division I basketball’s annual champion has been
decided through a tournament following the conclusion of the regular season. Since 1985, the annual tournament has included 64 teams, giving rise to the modern brackets so closely associated with the tradition famously called “March Madness”. Each year an estimated 70 million brackets are submitted to a variety of bracket contests, all with dreams of predicting every step of the tournament correctly. These dreams are promptly crushed, as the probability of ever devising a perfect bracket are so infinitesimally low, Warren Buffett has famously offered a $1 billion dollar prize on occasion for any who could accomplish the feat. To date, none have come close to succeeding. As part of the cultural phenomenon that is March Madness, Kaggle hosts an annual competition to create a statistical model that best predicts the outcomes of the annual basketball tournament, but with a slight twist in measuring success compared to the traditional bracket competitions. This project employs a variety of modeling approaches with unique perspectives on evaluating the strength of a team as the tournament begins. Model performance is benchmarked and reviewed against historical tournaments and their respective Kaggle leaderboards. Additionally, strong considerations are given for how to better refine the analysis for future tournaments, embodying the spirit of March Madness in striving to eventually be declared champion.
Introduction to Cluster Stability Using the Clusterboot Function
By: Michael Tucker, G'22, Specialist Track
Clustering data is a challenging problem in unsupervised learning where there is no
gold standard. The selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings, is often viewed as a subjective process. Stability has become a valuable surrogate to performance and robustness that is useful for guiding an investigator to reliable and reproducible clusters. This project walks through some of the core concepts of stability and demonstrates them in the R programming language with the “clusterboot” function.
The Demographics of Unemployment
By: Alison Delano, G'21, Generalist Program
This project utilizes time-series analysis to compare changes of unemployment percentages with education level. The data came from the US Bureau of Labor Statistics. I was interested in how a fluctuation in the economy would affect unemployment. I worked with unemployment data for twenty years, from January 2000 to December 2020, and ages twenty-five and up. I compared unemployment data for a high school diploma, less than a high school diploma, Bachelors, Masters, and Doctoral. The data was divided up by year and month and there were 252 observations per education level. I was interested in comparing the change in unemployment be education level and also forecasting the unemployment percentage. During these twenty years, there was a recession and the start of a pandemic
The Price of Admission
By: Emily Leech, G'21, Generalist Program
Private school admissions have recently been a topic of hot debate in the media. While under a microscope due to less than ethical choices made by select individuals in the college community, independent school admissions truly are a gamble for all involved. The ability to predict who will submit an application and in turn who will accept an offer of admission is a common argument in conversations with colleagues, families, consultants, and other schools. Leveraging data mining techniques, data science may provide some insights to the new age of admissions. To have an idea of where a school should strategically target campaigning efforts when budgets are limited, and offices are small is essential to a school’s success. In this work, I will utilize admissions survey data that is collected from applicants, which changes trends annually. The data is for Suffield Academy in the year 2020-2021. I will discuss the concepts of both classification and random forests, and how these methods may aid in better forecasting admissions applications and perhaps yield each year. Results based on variable importance will be used to prioritize important variables in the dataset. Through this presentation, I will also discuss the important problem of class imbalance, a common problem with datasets and application of algorithms, and how to overcome it with down-sampling.
Consumer Expenditure and the Stock Market: Time Series Analysis Predictions
By: Dane Martin, G'21, Generalist Program
This study looks at the Dow Jones Industrial Average (DJIA) return over a 20-year period with actual monthly data available on many financial sites, specifically Yahoo Finance. Forecasting models are well suited for modeling data of this type because they leverage observed changes and trends enables us to make future predictions. The purpose of this study is to explore different univariate models’ outputs using the Naïve, Simple Exponential Smoothing, HoltWinter, ARIMA and the TBATS methods to determine the best fit by calculating the Mean Average Percentage Error for each model. The MAPE is a measure of prediction accuracy of a forecasting method and we have run it on all the univariate data of the DJIA and come up with the best performing model after comparing the outputs. We also bring in the economic indicator data for consumer Price Index (CPI) to then use multivariate time series analysis that has more than one time-dependent variable and compares the DJIA and CPI data showing some dependency on each other. The vector autoregression (VAR) model is one of the most flexible and easy to use models for the analysis of multivariate time series. With these understandings, individuals can look at the seasonal variance and steady flow of any index to understand and decide to invest in the stock market.
Understanding the Demographics of Breastfeeding
By: Hope Shiryah Suplita, G'21, Generalist Program
There are multiple factors that impact the duration a child is breastfed for. This project will attempt to test out two different hypotheses to see if race, specifically, impacts breastfeeding duration. Specifically, I will (1) determine if race plays a significant role in the length that a person breastfeeds for, and (2) determine if among those that breastfeed, if there is a significant difference in the race of those that elect to do so for over six months. Data modification were made in order to ensure the race categories could successfully be tested. In order to investigate these hypotheses, Various testing and modeling concepts in R, including ANOVA, linear regression models and Fisher’s exact tests were used to determine if race does indeed have an impact on breastfeeding duration. Results show that race play a significant role in the various models considered. Understanding the impact on race and breastfeeding length may ultimately influence how breastfeeding resources, support and educational resources are distributed.
Analyzing Churn for ITRAC LLC – A Dental Practice Solution Company
By: Matthew Reichart, G'21 Specialist Program
Itrac LLC is a dental solutions company which offers a product that helps dental practices set up an insurance alternative membership program. Like many companies, itrac has data that is not fully utilized, and could be better understood using data science techniques or example, an Exit Survey of people leaving the membership program exists but is not streamlined. It was created with an option called “Other” which does not give direct information. This project utilizes data science methods such as classification, natural language processing and exploratory data analysis to examine survey data and practice characteristics in an effort to understand better understand churn and high activity within the dental program.
Not all practices who use the membership program use it forever. When a large practice stops using the program this creates revenue loss. By using EDA to allow for Feature
Engineering a basis for whether a practice is canceled or not was established. From there Random Forests were used to understand variable importance and accuracy of predicting the status of a practice. The status being “Active”, “High Performing”, “Churned”, or “Future Churn”. These results were used to interpret results of applying the APRIORI algorithm, which aimed to create rules/flags for these statuses. Results indicate that the most important variable related to Churn are the “Change from the Maximum”, and for High Performers it is the “Patient Count”. With backwards selection, linear models were created for each. The analysis shows promising results and association that may guide future investigations.
Characterizing Unemployment in the US with Timeseries Analysis and Unsupervised Learning
By: Curtis Stone, G'21, Specialist Program
The unemployment rate in the US and its individual states is a very important thing to consider by those in governmental leadership roles. However, it is highly volatile and connected to many other factors. In this project, I utilize a combination of timeseries analysis and unsupervised learning techniques to explore data to identify patterns of similarity and correlation patterns between different industries, states, and the United States overall. I also seek to identify the variation explained within each topic of interest for further insights. Finally, various visualizations of the data are available and can be modified for easy consumption of the data. I demonstrate how these resources can be used to extract insights from many possible questions that can be asked. States that prove to be the most dissimilar are extracted and the reason for this behavior is determined, seasonality within industries is explored, variation within regions is identified, and how unemployment and the S&P 500 interact is demonstrated. This approach and accompanying tools can be utilized to identify patterns of risk during events and timeframes, and action may be taken to improve unemployment rates or assist those at highest risk.
Tableau (visualization tool showcased in the presentation): https://public.tableau.com/profile/curtis.stone7796#!/vizhome/UnemploymentCapstone_1614030 4530220/USUnemployment?publish=yes
Link to Github (complete files and code used to create this project): https://github.com/CurtisPS/Unemployment-in-the-US-Capstone
Presidential Inauguration: A Sentiment Analysis
By: Jameson Van Zetten, G'21, Specialist Program
The aim of this project is to develop a working tool that can ingest, pre-process, classify, and analyze the sentiment of live data related to the 2021 President Inauguration from Twitter. This project utilizes Python for JSON data ingestion and interpretation from the Twitter API, which is then stored in a database using MySQL. This data is then cleaned and analyzed utilizing the TextBlob package in Python, which is based on the Natural Language Toolkit. This data is analyzed for two key metrics: Polarity and Subjectivity. Polarity assesses the overall emotion contained in the text, while Subjectivity assesses whether the statements are factual or based on emotion/opinion. This data was then related back to the overall topic of the tweet to help gain insight additional insight on the topic. After collection and processing this data, Tableau was used to assist in the visualization and interpretation via a custom dashboard.
Keywords: Sentiment analysis, presidential inauguration, API, Python, TextBlob
Slides (with links): https://drive.google.com/file/d/15r8KGBRUnpZtcm_tEfuqaQEYnhd5Dy9/view?usp=sharing
Interactive Dashboard: https://public.tableau.com/shared/T3SHFHC84?:toolbar=n&:display_count=n&:origin=viz_share _link
Medical Endoscope Device Complaints
By: Paul Waterman, G'21, Specialist Program
Human emotions are enormously powerful. The best brands always connect with their consumers and appeal to human emotions. Sentiment analysis can be a powerful tool for
enhancing not only customer experience but also brand management, marketing strategy, and new product development initiatives.
Customer complaints can provide a wealth of information. Often customer complaint descriptive history is simply ignored because of not knowing what to do with the narrative. Traditional statistical methods are often used to track the number of complaints without really understanding the words and emotions behind the complaint. We throw away complaint text data, but it can be a gold mine of valuable learnings and sentiments. The evolution of natural language processing (NLP) techniques allows us to dive into the complaint narrative and extract human emotion that could lead to a better understanding of customer perception towards an organization or product.
This presentation is an exploration of humanizing complaints related to medical endoscopes. Two sources of data used are the FDA Manufacturer and User Facility Device (MAUDE) public data and the second is an internal database maintained by the company. The data is subjected to NLP techniques of stemming, lemmatizing, stop words, etc., followed by various clustering techniques and finally a sentiment analysis to evoke human emotions behind the complaint.
US Food Environments: Clustering and Analysis
By: Stephanie Cuningham, G'20
“Food environment” is the culmination of socioeconomic and physical factors which shape the food habits of a population. The USDA’s 2020 “Food Environmental Atlas” (FEA) contains county-level data on 281 of these factors. A central theme in the discussion of food environment has been the idea of a “food desert,” generally defined as an area that is both low-income and low-access. However, recent research calls this question, with claims that choices drive poor health outcomes, rather than lack of nutritional access. In this analysis, the FEA along with 2018 Census data on labor participation rates and educational attainment were examined to determine whether clustering areas by common food environment can reveal how underlying factors and health outcomes (here, diabetes rates) vary beyond food desert classification.
Bayesian Principal Component Analysis (PCA) and K-means clustering were performed with counties stratified by metro status. This yielded six clusters, plotted against the strongest loadings for each component to create biplots, and on choropleths. Clusters’ relationships to food desert diabetes rate were statistically analyzed. Random forests were run against these two outcomes to compare variable importance rankings. Findings showed that demographics (age, race, educational attainment) are more important to health outcome than to food desert status, and labor participation is more important to food desert status than to health outcome. SNAP benefit participation is important to both outcomes. Overall, results indicated that while food environment, food desert status, and diabetes rate are inter-related, certain factors associated with food desert status are more correlated with poor health outcomes than others. While poverty and access are important to food environment, this is not a simple causal relationship and further clustering work could reveal more personalized solutions for populations in need.
Keywords: non-profit, data processing, Alteryx, data integration
Data Preprocessing for Non-profits
By: Josephine Hansen, G'20
For Non-profits their data is very important and essential for being able to get new grants and get their support out to more individuals in need of help. The major issue with non-profit data is that this has not been historically a major priority. Consequently, the data is not clean and cannot be easily transferred into a new site that will help them keep track of their data, prioritize their findings, and be able to track their outreach to individuals in need. Data pre-processing is very important but extremely time consuming. For a non-profit, this unfortunately equates to unnecessary expense associated with time expended cleaning the data files. This project addresses this issue though the development of a tool called Alteryx that is exceptional at saving time, in turn this will help non-profits save money that would go towards the data preprocessing and cleansing. This strategy allows new data to be inserted into pre-designed dataflows that within seconds will perform the needed tasks in order to have the data cleaned and ready to go to be migrated into their new site.
Keywords: non-profit, data processing, Alteryx, data integration
Something of a Painter Myself
By: Khalil Parker, G'20
The goal of the project is to build a generative adversarial network that is able to create new instances of Monet art. This project is part of an ongoing Kaggle competition. The approach that I take relies on the use of a cycleGAN to transfer the style of a Monet onto a photorealistic image. CycleGAN is made up of four neural networks, two of which are generators and the other two are discriminators. The generators are used to create new images. The discriminators are used to tell if an image is real or fake. While training, the generators and discriminator are used against each other to increase the accuracy. If the discriminators learn how to detect the generated images, then the generator will adjust its weights to create better images. If the generators learn how to fool the discriminators, then the discriminator adjust its weights to detect fake images better. This process continues until the generator can create images that look like Monet painting. Two architectures, Resnet and Unet are considered, and results and comparisons are made.
Keywords: CycleGAN, GAN, Monet, Kaggle, Resnet, Unet
Sentiment Analysis of Tweets Originating from Nigeria
By: Olaleye Ajibola, G'20
Sentiment Analysis is the process of understanding the opinion of an author about a subject. For a successful sentiment analysis, the opinion (or emotion), the subject of discussion (what is being talked about), and the opinion holder are the main components required. For this project, a sentiment analysis was performed on data originating from Nigeria. Twitter data was extracted for the date range of October 4th - 12th. Over 1 million tweets were extracted during this period and analyzed. R Studio, Alteryx, SQL Lite and Tableau were used to extract, store, and transform the data as well as to perform the data visualization. Once data is extracted, it had to be preprocessed and cleansed in order to strip out irrelevant words. Sentiment Analysis in R was performed using 3 different sentiment libraries in R - Bing (positive/negative sentiments), NRC (multiple emotions) and AFINN (polarities). The sentiment libraries correctly and successfully picked up the negative sentiments arising from the recent and ongoing protests against police brutality in Nigeria. In addition to the sentiment analysis, the top 50 and 75 words were also mined from the data with the most common words in the over one million tweets being “god” and ”sars”, again successfully recognizing the key words from the protests.
Keywords: Twitter, Nigeria, sentiment analysis, topic analysis, dynamic sentiments
Sentiment Analysis on Twitter – a generalizable approach with applications to COVID vaccines
By: Patrick Fitzgerald, G'20
Sentiment Analysis is the automated process of identifying and classifying subjective information in text data and sorting it into segments. Using sentiment analysis tools to analyze opinions on Twitter data can help to understand how people are talking and interacting online with a certain topic. It can help to better understand an audience, keep on top of what’s being said for a given topic, and discover new trends related to that topic. The most common type of sentiment analysis is ‘Polarity Detection’ which involves classifying a statement as ‘Positive’, ‘Negative’, or ‘Neutral’.Specifically, the TextBlob Python package was used to determine the Polarity, Subjectivity, and Sentiment of tweets pulled over a several week period relating to the ongoing development of a Coronavirus Vaccine. The Polarity indicates a nuanced approach to viewing the sentiment of a statement whereas the Subjectivity works to identify whether the statement is a fact or opinion. Then, the Sentiment is classified based on the Polarity of the given statement. Finally, an emphasis was placed on scalability and reproducibility. By introducing sentiment analysis tools into a workflow; unstructured tweet data was automatically pulled, cleaned, and classified in real-time, at scale, and accurately. This allows for a wide range of possible topics to be plugged into the workflow and does not limit the scope to just the single question posed in this project.
Keywords: Twitter, sentiment analysis, data processing, COVID
Housing Price Prediction: App development
By: Robert Anzalone, G'20
This project aims to provide a predictor for the sale price of a single-family home that is generalized to markets within the United States. Models leverage postal code geospatial information to capture the necessary granularity of the markets. The intent of the resulting application is not to provide the most realistic and precise prediction, but rather to allow users to form a “ballpark” estimate, as well as to exploit the effects of the input features on the projected sales prices from market to market. In the event that the home has not been sold before, it if it is located in an area where few other comparable homes have sold, a generalized model of this type is well-suited to provide a “baseline” home value.
Keywords: Housing, prediction, API, data integration, geospatial data, Shiny
Housing Prediction App: https://rtanzalone.shinyapps.io/housingpricepredictor/
Please visit the
program page to learn about the two tracks, curriculum, faculty, program options, and more! MS in Applied Data Science