In my research series, I focus on the examination of biases present in large language models like ChatGPT, particularly those related to the representation of countries and their associated cultures. Through this exploration, I have developed methodologies to measure and address these biases, aiming to minimize their potential negative impact in societal contexts.Additionally, I investigate the implications of such skewed interpretations and biased text generation on unsuspecting human readers who may not be aware of the sources of the generated text. My ultimate goal is to establish a comprehensive framework that assists both developers and practitioners in creating more inclusive and equitable solutions when utilizing large language models.
Theme: Natural Language Processing (NLP), Bias in AI, Auditing.
BITS is a sentence repository test that consists of 2,896 sentences curated to probe sentiment and toxicity analysis models for biases in sociodemographic factors. The dataset is currently divided into three primary facets. The three facet represent each sociodemographic factor. i.e. disability, race and gender.
The purpose of this facet is to be used as a test to check if sentiment or toxicity models contain possible bias towards words that mention certain minority group with respect to specific sociodemographic factors. With the help of these curated sentences, we can be more aware of the potential biases that can be present in these language models before using them.
Theme: NLP, Bias in AI, Dataset
This project focuses on the analysis of behavior and patterns of broadcast police communications (BPC). Through qualitative and quantitative analysis, we explore how BPC change when social factors such as economic status, race and age come into play. Machine learning methods such as supervised and unsupervised language models are used to predict outcomes of policing related to minority groups within various city zone in a state.
We also explore the if BPC contain potential breach of PII (Personable Identifiable Information) that can be misused to identify specific group or population.
Theme: Social NLP, Thematic & Quantitative Analysis, Language Model Creation
This project shows how computational notebooks can be used to create interactive visualizations to learn fundamentals about Machine Learning and AI. We focus on the concepts of LDA (Latent Dirichlet Allocation) through this computational notebook.
The visualization narration leads to the conversation of how language model can be biased towards minority population. This is demonstrated using the the same principles of LDA as well.
This project was demonstrated in the 4th IEEE VIS Workshop onVisualization for AI Explainability (VisxAI 2021).
Theme: Data Visualization, Ethics in AI, Computational Notebook, NLP Bias, HCI
This design-centric data visualization project developed to understand the context of loneliness and depression in social media. We scrapped through tweets related to depression and mental health to perform thematic analysis to understand the social definition of loneliness. Through the application of supervised NLP techniques, we demonstrate better insights of the issue.
Team Members: Anjana Padmakumar, Pranav Venkit
Theme: Thematic and Discourse Analysis, Data Visualization, Application of NLP
This study focuses of negative online behavior, like toxic comments. The study was motivated by theKaggle Toxicity Classification Challenge.. We built a multi-headed model capable of detecting 6 different types of toxicity - threat, obscene, toxic, servere_toxic, insult and identity_hate to help improve online conversation. The model created is a ensemble of various deep learning models with language embeddings.
Team Members: Zeba Karishma, Pranav Venkit
Theme: Application of NLP, Language Models, Deep Learning
Through this work, we showcase the importance of using Twitter features to help the model understand the sentiment involved and hence to predict the most suitable emoji for the text. To further understand emoji behavioral patterns, we released a balanced dataset by crawling Twitter data, including timestamp, hashtags, and application source acting as additional attributes to the tweet. Our data analysis and neural network model performance evaluations depict that using hashtags and application sources as features allows to encode different information and is effective in emoji prediction.
Team Members: Pranav Venkit, Zeba Karishma, Chi-Yang Hsu, Rahul Katiki
Theme: Application of NLP, Sentiment Analysis, Language Models, Deep Learning
'Beeism' attempts to study the change in beehive activity at different periods of the year by visually translating it to a format that can be easily understood. Analyzing bee behavioral dataset can be particularly complex and nuanced. To tackle this issue, the visualization communicates the beauty and vitality of the data by showcasing them through effective code and design, using p5.js. The change in Beehive weight over a period of a year was used for this visualization. The visualization makes it easier to understand the differences in beehive activity for different seasons throughout the year. This shows the impact of climatic conditions in a bee colony.
Team Members: Anjana Padmakumar, Pranav Venkit
Theme: Data Visualization, Design through Data
Beyond Tweet combines sentiment analysis using AI and various other data environments. The way our data is layered, we can see an infinite possibilities of demographic analysis that can be customizable by any users. Using openly available twitter data-set (5 million tweets) we indexed them at scale, using elastic search. Using US Census data and machine generated sentiment value, we created a dashboard to showcase sentiment of topics, with respect to various demographic values, in USA.
Team Members: Shaurya Rohatgi, Mukud Srinath, Pranav Venkit
Theme: Application of NLP, Social Media Analysis, Sentiment Analysis