Hello ,
I am Era Parihar

A Data Scientist with focus on ML, NLP, Statistics and LLMs check out my Resume here

About Me

Hello! I'm Era Parihar, a Data Scientist passionate about solving real-world problems through analytical reasoning and intelligent systems. I recently completed my Master’s in Data Science at the University of Michigan, Ann Arbor, where I was recognized with a merit-based scholarship from the state of Rajasthan, India, and had the honor of being chosen to speak on behalf of my graduating class at our commencement ceremony inMay 2025

My journey began with a strong foundation in Computer Science from Birla Institute of Technology and Science, Pilani, and has since evolved across a wide range of domains. During my internship at the United Nations, I fine-tuned AI models to enhance climate-document classification and disaster response. I also built an Azure-based Retrieval-Augmented Generation (RAG) pipeline that automatically produces concise, citation-rich summaries—making critical insights more accessible to policy teams.

Previously, I applied machine learning to build predictive models and recommendation systems in the fintech and environmental sectors, driving user engagement and supporting ESG-focused decision-making.

My research at Carnegie Mellon University explored how large language models learn in early training stages, using contrastive learning to fuse speech and text signals—advancing capabilities in multimodal AI.

Equipped with hands-on experience in tools like Python, SQL, Tableau, and a wide suite of ML libraries, I thrive at the intersection of NLP, deep learning, and data-driven strategy. I’m always curious about how data can inform decisions, uncover patterns, and drive meaningful impact.

EDUCATION

University of Michigan Logo

Master of Science in Data Science

August 2023 – May 2025

GPA – 3.4 / 4.0

Relevant Coursework

Statistical Modeling, Machine Learning, Deep Learning, NLP, Data Mining, Applied Regression, Time Series Forecasting, Causal Inference, Data Visualization, Data Engineering

Projects

MedQuery: Evidence-Based Clinical Decision Support

Developed a complete RAG pipeline: retrieved articles from PubMed using NCBI E-utilities, embedded abstracts with sentence transformers, indexed them using FAISS, and generated chain-of-thought explanations with open-source language models.

Enabled explainable, verifiable answers by linking every claim to source articles — bridging the gap between LLMs and evidence-based medicine.

Tools: Python, Streamlit, Hugging Face Transformers, SentenceTransformers, FAISS, PubMed API, Torch

*currently working on deployemnt for easy accessibility

Register-Augmented LLM Finetuning

Collaborated with a team to develop a novel “register-augmentation” technique for transformer-based language models (BERT), improving question-answering performance by adding specialized “register” tokens during fine-tuning.

Implemented interpretability methods (Integrated Gradients, Layer-wise Relevance Propagation) to visualize how register tokens enhance the model's focus on task-relevant context, resulting in better F1 and ExactMatch scores compared to standard fine-tuning.

Answer-Aware Question Generation System

Fine-tuned T5 and BART on QA datasets to generate high-quality questions, using BLEU, ROUGE, METEOR, and BERTScore to compare models and optimize performance.

Ann Arbor Water Production Forecasting

Using Voting Regressor for Time-Series Prediction

Built a predictive model to forecast daily water production in Ann Arbor based on 8 years of weather and usage data. Engineered temporal features and environmental indicators to capture seasonal demand fluctuations driven by rainfall and temperature.

Implemented and benchmarked multiple regression models — including Random Forest, Gradient Boosting, and Ridge and combined them using a Voting Regressor for improved accuracy and stability.

Enabled data-driven water resource planning by predicting future demand and reconstructing sensor gaps, supporting the city’s sustainability goals.

Tools: Python, Scikit-learn, Voting Regressor, Pandas, Matplotlib, NumPy

ThermDepth

Forecasting Subsurface Lake Temperatures for Environmental Intelligence

Predicted hourly temperatures at 10.5m depth in Trout Lake (2018–2019) using multivariate time-series data from 2012–2018 across varying depths.

Engineered temporal and spatial features; benchmarked models including Random Forest, XGBoost, and LSTM for best performance.

Supported climate-aware ecological monitoring by reconstructing missing sensor data with high accuracy.

Tools: Python, Scikit-learn, XGBoost, LSTM, Pandas, Matplotlib

PotholeSense

Predicting Pothole Severity in Chicago Using LightGBM for Smart Infrastructure Maintenance

Built a supervised machine learning pipeline using LightGBM to classify pothole severity across Chicago, utilizing 311 service request data, traffic volume, and road condition features. The model achieved 92% precision and 88% recall in identifying high-risk potholes, enabling data-driven prioritization of repair work

Engineered predictive features including response lag, weather-seasonality, and zip-code-based clustering. Applied robust preprocessing—missing value imputation, label encoding, and outlier handling to prepare the data for training.

Benchmarked LightGBM against Decision Trees, Random Forests, and XGBoost using stratified cross-validation and grid search. LightGBM consistently delivered the best performance with low inference time.

Tools Stack: Python, LightGBM, Scikit-learn, XGBoost, Decision Trees, Random Forests, Feature Engineering, Hyperparameter Tuning, Cross-validation, Pandas, Geopandas, Matplotlib, Seabornb

Articles

Latest posts on Machine Learning, NLP, and Generative AI

Skills

Technical strengths across ML, Programming, Data Engineering & Analytics

Machine Learning & Statistics

LLMs, NLP, LangChain, Deep Learning, Predictive Models, Decision Trees, Clustering, Regression, Experimentation, Hypothesis Testing, A/B Testing, Time Series Forecasting, Descriptive & Inferential Stats, Quantitative & Data Analysis.

Programming Languages

Python, SQL, NoSQL, R, SAS, Matlab, Java

Libraries

PyTorch, Matplotlib, PySpark, PyCharm, TensorFlow, Pandas, Seaborn, NumPy, ScikitLearn, NLTK, SpaCy, Streamlit

Data Engineering & Platforms

Docker, Kubernetes, Jupyter, Git, MySQL, Tableau, Google Looker, AWS, Apache Airflow, Snowflake