DataFirst

DataFirst Projects Fall 2023

August 31, 2023 at 13:32

Script by Ulf Hermjakob

(1) Learning and forgetting in neural networks

(2) Utilizing AI Generated Images for Object Detection and Classification

(3) ~~A Knowledge Graph of a Crowdsourcing Event~~ (Postponed to Spring 2024)

(4) Urban Futures Data Core

(5) Pyleoclim: A Python Package for the Analysis of Paleoclimate Data

(6) AI/ML assisted fault detection in foundry processed devices

(7) Assessing the California Public Sector Job Market

(8) Does Municipal Broadband Deliver as Promised? An examination of broadband pricing and household adoption in areas served by muni networks.

(9) Automated question type coding of forensic interviews

(10) Building a Platform for NFL Data Insights

(11) Understanding the Relation Between Noise and Bias in Annotated Datasets

(12) Federated Learning for Neuroscience

(13) Bad Writing is "Fine": Tuning an LLM to Suggest Improvements

(14) Analyzing Open Source Software Ecosystems

(15) Build a multilingual decipherment system

(16) Natural language processing of safety reports in nuclear power plants

(17) Application of AI, ML and NLP in understanding and preventing a serious aviation safety problem in the US - Runway Safety

(18) AI Ethics for Smart Health through Smart Watches

(19) Regular Data: Quality health monitoring while you sit (added on Sept. 8, 2023)

(1) Learning and forgetting in neural networks Prof. Marcin Abram

In this project, you will examine the mechanism responsible for forgetting previous tasks in artificial neural networks. You will study how those mechanisms shape the behavior of neural network learning from heterogeneous data distributions. You will investigate how new information is stored in neural networks by plotting and interpreting the neuron activation patterns. You will also compare different learning schemas, and you will examine how they influence the final loss function landscape.

Skills needed: Python, TensorFlow or PyTorch
Students will learn: How the information is stored in neural networks. How neural networks can forget how to perform previously mastered tasks. How to interpret neural networks (by examining the neuron activation patterns). How to conduct scientific experiments (in the domain of machine learning). How to present and visualize scientific data.

(2) Utilizing AI Generated Images for Object Detection and Classification Prof. Seon Ho Kim

Developing image-based object detection and classification models requires significant time, resources, and effort. Especially, acquiring a good training dataset is essential. However, there are some cases when it is very hard to get quality data such as rare cases (e.g., disasters) or expensive cases to get (e.g., faraway places). Due to the development of generative AI, we might produce synthetic images to enhance the quality of dataset by filling up missing images with them. Based on our prior work in object detection and classification for smart city applications, we would like to explore the potential of AI generated images for an enhanced object detection and classification.

Skills needed: Python, Some knowledge in image machine learning, Knowing existing methods such as Yolo is desirable, Imagination!!!
Students will learn: Image machine learning, object detection

(3) ~~A Knowledge Graph of a Crowdsourcing Event~~ (Postponed to Spring 2024) Prof. Daniel O'Leary

Build a knowledge graph of a crowdsourcing event.

Skills needed: NEO4j, or Python
Students will learn: Crowdsourcing, knowledge graphs and text analysis

(4) Urban Futures Data Core Prof. Alice Chen

Cities are the focal point of economic, social, and environmental challenges and opportunities. To establish USC as a thought leader and partner of choice to tackle the challenges of the urban future, the USC Sol Price School of Public Policy and the USC Marshall School of Business propose establishing an Urban Futures Data Core to serve as a university-wide hub for data analysis and dissemination. Students working on this project will work with all faculty at Price and Marshall to catalogue the publicly available, restricted-use, and self-collected datasets that USC researchers have previously used. They will then create a secure website to track each data source and its data use agreements, dates of availability, and geographic level of granularity. After a data website is constructed, students will have the opportunity to assist with creating geographic visualizations of key indices related to urban futures.

Skills needed: web design, python
Students will learn: The students will learn about all data sources used in public policy and business, data management, and web design.

(5) Pyleoclim: A Python Package for the Analysis of Paleoclimate Data Prof. Deborah Khider/Julien Emile-Geay

Paleoclimate timeseries data are crucial to understand how climate has changed in the past. A major aspect of this work falls under exploratory analysis, and in particular, visualization. Pyleoclim contains many functionalities for timeseries analysis of paleoclimate data and has already been used in teaching and research settings. In the coming months, we are expanding several functionalities of the package to address growing community need: outlier detection, automated visualizations, automated checks for the validity of datasets loaded into the package. In addition, these new functionalities will be integrated into tutorials distributed through a Jupyter Book.

Skills needed: Python: pandas, numpy, matplotlib, seaborn. GitHub knowledge preferred but not necessary
Students will learn: Timeseries analysis, Python packaging, continuous integration, containerization, GitHub, Jupyter, Binder.

(6) AI/ML assisted fault detection in foundry processed devices Prof. Andrew Rittenbach and JP Walters

Highly accurate fault detection in foundry produced microelectronics is crucial to ensuring quality of devices that leave the foundry. However, current defect detection flows are human-centric, which produces a bottleneck. The objective of this project is to leverage recent advances in AI/ML to develop automated techniques that can 1) identify manufacturing defects in microelectronics using imagery collected at the foundry, and 2) determine whether the identified defect will impact the performance of the manufactured component.

Skills needed: Python, PyTorch or TensorFlow, image analysis
Students will learn: Students will learn about manufacturing defect detection algorithms, machine learning techniques, and microelectronics fabrication.

(7) Assessing the California Public Sector Job Market Prof. William Resh

Public sector institutions at local, state, and federal levels are facing an unprecedented hiring crisis in competition for new talent. Yet there is no systematic understanding of the needs and openings across these levels of government to inform stakeholders such as universities, community colleges, and high schools on the current and emerging hiring trends in what constitutes approximately 15-20% of the entire labor market. In this project, students will develop algorithms that continuously scrap relevant job sites used by these governments to assess both developed and emerging hiring trends by aptitudes, professions, entry-levels, mobility, location, and other important attributes. In so doing, the project will inform researchers in public policy, public administration, political science, and labor economics as well as practitioners in government and associated stakeholders.

Skills needed: Python, Statistics, R or Stata
Students will learn: Students will learn how to develop and organize labor market data to be used by practitioners and researchers through the construction of portal that can ably transform data into usable aggregated statistics and graphs.

(8) Does Municipal Broadband Deliver as Promised? An examination of broadband pricing and household adoption in areas served by muni networks. Prof. Hernan Galperin

Broadband networks owned and/or operated by local governments ("muni networks") are increasingly seen as a key tool to close the digital divide in Internet availability and adoption. There is however only anecdotal evidence about whether muni networks deliver on the promise of more affordable broadband in communities of little interest to traditional ISPs - typically disadvantaged communities. Taking advantage of the greater level of resolution in the new FCC broadband availability maps, this project will examine broadband pricing and adoption at the address level in areas served by muni networks, using a matched sample of comparable areas as a reference point. The goal of the project is to empirically assert whether muni networks are delivering on the promise of more affordable services, and whether this results in more household adoption than expected. The project is a component of an ongoing collaboration with digital equity advocacy organizations.

Skills needed: Data scraping (Python), statistics and basic GIS skills
Students will learn: Students will have the opportunity to apply data scraping, organization and analysis skills in the context of policy analysis

(9) Automated question type coding of forensic interviews Prof. Thomas D. Lyon

Question type coding is used in research on forensic interviewing to distinguish between best practice open-ended questions, and closed-ended and leading questions that interviewers are trained to avoid. Most research teams in the field rely on a time-consuming and labor-intensive method of question type coding whereby a researcher codes every question in the interview, and a second researcher codes a subset to demonstrate inter-rater reliability. We are currently working with a graduate of the Masters in Computer Science program at USC on a project exploring automated question type coding of forensic interviews with victims of child abuse. In collaboration with the student, we have trained a large language model (RoBERTa) to distinguish between question types based on a rudimentary classification system. In the next stage of the project, we are aiming to finetune the model and use zero shot and few shot prompting to make distinctions for which there is limited manually-coded data.

Skills needed: Python and/or R
Students will learn: Students will learn to train and finetune large language models

(10) Building a Platform for NFL Data Insights Prof. Jeremy Abramson

Open source sports data such as the nflverse has lead to a massive increase in public sports analytics. But it's still hard to process, subset, visualize and analyze this data. This project will build a general-purpose analysis platform and dashboard, similar to what many teams use internally. Using the nflfastr data, this platform will allow interested individuals to select the play parameters they're interested in, and will provide relevant analysis, visualization and insight. Ideally, we'll set up the dashboard on the internet, and open source the project, allowing others to expand the available datasets, analyses and visualizations.

Skills needed: Python (pandas, streamlit, dash, etc. is a bonus!)
Students will learn: How to analyze and present insights from NFL play-by-play data

(11) Understanding the Relation Between Noise and Bias in Annotated Datasets Negar Mokhberian

When it comes to classification tasks, many previous work has tried to design larger and more complex neural networks. Recently, the line of data-centric AI has worked on shifting the focus to the quality of the train data. This shift arises from the recognition that the annotations associated with dataset instances can exhibit both noise, stemming from vague instructions or human errors, and bias, arising from differing perspectives among annotators in response to given prompts. In this project, our objective is to bridge the gap between the two lines of research: one dedicated to identifying noisy instances and the other striving to account for the diverse perspectives of annotators. Specifically, we will delve into the domain of offensive text detection datasets, a highly subjective task. Our investigation will center on whether perspectivist classification models have effectively harnessed valuable information from instances flagged as noisy by noise-detection techniques.

Skills needed: Python, PyTorch, Fine-tuning language models in Huggingface package
Students will learn: The student will learn the importance of individual instances and individual annotations in training the classification models. Each of these datapoints can introduce either useful signal or noise to the model and the student will learn to recognize the difference.

(12) Federated Learning for Neuroscience Prof. Jose-Luis Ambite

Federated learning is an approach to distributed deep learning without sharing data. Multiple site train a neural network over private data. The parameters of the neural network are shared with a federation controller, but they are encrypted before sharing. Model aggregation is performed under fully homomorphic encryption. We propose to apply federated learning to several problems in neuroscience, such as predicting Alzheimer's, Parkinson's, epilepsy, and autism, possibly over multimodal data.

Skills needed: Python, Tensorflow, deep learning
Students will learn: Federated learning, machine learning for biomedical applications.

(13) Bad Writing is "Fine": Tuning an LLM to Suggest Improvements Prof. Benjamin Nye

Prototype an approach to fine-tune a large language model (LLM) to help diagnose areas to improve a specific writing product. For example, scientific papers require consistent language but in creative writing variety matters. Proposed steps are: 1. Writing Product: Coordinate with project mentors to choose a common and important writing product, such as a position paper or an academic conference. Identify/gather a rubric and a corpus. 3. Inject Bad Writing: For each element of the rubric, develop prompts for generative AI to decrease the quality of writing based on the rubric (i.e., make it worse). This will form a training data set of the good example and version worse on certain characteristics. 4. Fine Tune: Students will be expected to attempt to fine tune an LLM (e.g., LLAMA 2) based on this synthetically generated data 5. Evaluate: Research if tuning suggests better domain-specific areas to improve. This project aligns with ongoing work with the USC Generative AI Center.

Skills needed: Python
Students will learn: Generative AI for large language models. Generating synthetic data for a rubric. Fine tuning a large language model, likely using CARC (the on campus computing cluster). Understanding intelligent tutoring system design fundamentals for modeling how experts diagnose issues from novices.

(14) Analyzing Open Source Software Ecosystems Prof. Jeremy Abramson

Open source runs a lot of the world's critical software systems, but there is much that's unknown in how maintainers, developers and other parts of the software ecosystem function. Help us analyze a large corpus of open source data — both source code and patch conversations — to better understand them! We'll study things like rise to influence, authorship styles, malware analysis, topic modeling and social network analysis!

Skills needed: Python (needed), experience with LLMs/OpenAI APIs, program analysis, C code (preferred, but not necessary!)
Students will learn: We'll touch on using LLMs to parse text messages and analyze code, graph databases, program analysis, and social network analysis among other skills

(15) Build a multilingual decipherment system Prof. Jonathan May

We will build a working system that can decipher a letter substitution cipher into 14 languages and beyond, based on https://aclanthology.org/2021.acl-long.561/ then apply it to languages it has never seen

Skills needed: python, deep learning with transformers
Students will learn: read and understand an NLP paper, unusual applications of transformers, reproduction study

(16) Natural language processing of safety reports in nuclear power plants Prof. Najmedin Meshkati

This project, which will be co-advised by Dr. Yolanda Gil, will use Natural Language Processing (NLP) techniques to analyze voluminous Diablo Canyon Independent Safety Committee (DCISC) annual reports to identify the role and contribution of "Traits of a Healthy Nuclear Safety Culture", as defined by the Nuclear Regulatory Commission and the Institute of Nuclear Power Operations, in incident causation.

Skills needed: Natural language Processing and related skills.
Students will learn: Application of NLP in real-world, working on very serious and important issues with global applications, which can be generalized and applied to other safety-sensitive technologies.

(17) Application of AI, ML and NLP in understanding and preventing a serious aviation safety problem in the US - Runway Safety Prof. Najmedin Meshkati

This project, which is co-advised by Dr. Yolanda Gil, will use AI/ML/NLP to understand root-causes of one of the serious aviation safety problem in the US - runway incursions. The Aviation Safety Reporting Systems, which is administered by the NASA and is an untapped treasure trove of textual data, will be used for this project.

Skills needed: AI/ML/NLP and associated skills.
Students will learn: Using AI/ML/NLP and working on the data from a major global industry - aviation.

(18) AI Ethics for Smart Health through Smart Watches Prof. Yolanda Gil

Lots of personal data can be obtained from wearable devices such as smart watches. This data can be used to improve health, for example to learn to detect health problems and to check whether people adhere to doctor’s exercise recommendations. This project will conduct a thorough study of the ethical issues in using AI systems in this domain, with recommendations of how AI systems for smart health should be designed with ethical considerations in mind.

Skills needed: Interest in healthcare and data science.
Students will learn: What kinds of health-related data can be captured through wearable devices, what kinds of analyses are possible, privacy and ethical aspects of personal applications for smart health.

(19) Regular Data: Quality health monitoring while you sit (added on Sept. 8, 2023) Prof. Francisco Valero-Cuevas, Prof. Yogi Matharu, Prof. Yolanda Gil

Goals: To create a software-as-a-service data pipeline for collecting health biomarkers via an instrumented toilet seat. This clinical data management system (CDMS) will enable collecting clinical-grade, high-quality, curated, and consistent data capture that meets NIH and FDA standards of clinical utility.

Needs: Engineering and software architecture skills to create the data pipeline from instrument signals to health reports compatible with reimbursement, research and clinical data systems.