Hello ,
I am Era Parihar
A Data Scientist with focus on ML, NLP, Statistics and LLMs check out my Resume here

About Me

Hello! I'm Era Parihar, a Data Scientist passionate about solving real-world problems through analytical reasoning and intelligent systems. I recently completed my Master’s in Data Science at the University of Michigan, Ann Arbor, where I was recognized with a merit-based scholarship from the state of Rajasthan, India, and had the honor of being chosen to speak on behalf of my graduating class at our commencement ceremony inMay 2025
My journey began with a strong foundation in Computer Science from Birla Institute of Technology and Science, Pilani, and has since evolved across a wide range of domains. During my internship at the United Nations, I fine-tuned AI models to enhance climate-document classification and disaster response. I also built an Azure-based Retrieval-Augmented Generation (RAG) pipeline that automatically produces concise, citation-rich summaries—making critical insights more accessible to policy teams.
Previously, I applied machine learning to build predictive models and recommendation systems in the fintech and environmental sectors, driving user engagement and supporting ESG-focused decision-making.
My research at Carnegie Mellon University explored how large language models learn in early training stages, using contrastive learning to fuse speech and text signals—advancing capabilities in multimodal AI.
Equipped with hands-on experience in tools like Python, SQL, Tableau, and a wide suite of ML libraries, I thrive at the intersection of NLP, deep learning, and data-driven strategy. I’m always curious about how data can inform decisions, uncover patterns, and drive meaningful impact.
EDUCATION

Master of Science in Data Science
August 2023 – May 2025
GPA – 3.4 / 4.0
Relevant Coursework
Statistical Modeling, Machine Learning, Deep Learning, NLP, Data Mining, Applied Regression, Time Series Forecasting, Causal Inference, Data Visualization, Data Engineering
Projects

MedQuery: Evidence-Based Clinical Decision Support
Developed a complete RAG pipeline: retrieved articles from PubMed using NCBI E-utilities, embedded abstracts with sentence transformers, indexed them using FAISS, and generated chain-of-thought explanations with open-source language models.
Enabled explainable, verifiable answers by linking every claim to source articles — bridging the gap between LLMs and evidence-based medicine.
Tools: Python, Streamlit, Hugging Face Transformers, SentenceTransformers, FAISS, PubMed API, Torch
*currently working on deployemnt for easy accessibility

Register-Augmented LLM Finetuning
Collaborated with a team to develop a novel “register-augmentation” technique for transformer-based language models (BERT), improving question-answering performance by adding specialized “register” tokens during fine-tuning.
Implemented interpretability methods (Integrated Gradients, Layer-wise Relevance Propagation) to visualize how register tokens enhance the model's focus on task-relevant context, resulting in better F1 and ExactMatch scores compared to standard fine-tuning.

Answer-Aware Question Generation System
Fine-tuned T5 and BART on QA datasets to generate high-quality questions, using BLEU, ROUGE, METEOR, and BERTScore to compare models and optimize performance.

Ann Arbor Water Production Forecasting
Using Voting Regressor for Time-Series Prediction
Built a predictive model to forecast daily water production in Ann Arbor based on 8 years of weather and usage data. Engineered temporal features and environmental indicators to capture seasonal demand fluctuations driven by rainfall and temperature.
Implemented and benchmarked multiple regression models — including Random Forest, Gradient Boosting, and Ridge and combined them using a Voting Regressor for improved accuracy and stability.
Enabled data-driven water resource planning by predicting future demand and reconstructing sensor gaps, supporting the city’s sustainability goals.
Tools: Python, Scikit-learn, Voting Regressor, Pandas, Matplotlib, NumPy

ThermDepth
Forecasting Subsurface Lake Temperatures for Environmental Intelligence
Predicted hourly temperatures at 10.5m depth in Trout Lake (2018–2019) using multivariate time-series data from 2012–2018 across varying depths.
Engineered temporal and spatial features; benchmarked models including Random Forest, XGBoost, and LSTM for best performance.
Supported climate-aware ecological monitoring by reconstructing missing sensor data with high accuracy.
Tools: Python, Scikit-learn, XGBoost, LSTM, Pandas, Matplotlib

PotholeSense
Predicting Pothole Severity in Chicago Using LightGBM for Smart Infrastructure Maintenance
Built a supervised machine learning pipeline using LightGBM to classify pothole severity across Chicago, utilizing 311 service request data, traffic volume, and road condition features. The model achieved 92% precision and 88% recall in identifying high-risk potholes, enabling data-driven prioritization of repair work
Engineered predictive features including response lag, weather-seasonality, and zip-code-based clustering. Applied robust preprocessing—missing value imputation, label encoding, and outlier handling to prepare the data for training.
Benchmarked LightGBM against Decision Trees, Random Forests, and XGBoost using stratified cross-validation and grid search. LightGBM consistently delivered the best performance with low inference time.
Tools Stack: Python, LightGBM, Scikit-learn, XGBoost, Decision Trees, Random Forests, Feature Engineering, Hyperparameter Tuning, Cross-validation, Pandas, Geopandas, Matplotlib, Seabornb
Articles
Latest posts on Machine Learning, NLP, and Generative AI
Skills
Technical strengths across ML, Programming, Data Engineering & Analytics
Machine Learning & Statistics
LLMs, NLP, LangChain, Deep Learning, Predictive Models, Decision Trees, Clustering, Regression, Experimentation, Hypothesis Testing, A/B Testing, Time Series Forecasting, Descriptive & Inferential Stats, Quantitative & Data Analysis.
Programming Languages
Python, SQL, NoSQL, R, SAS, Matlab, Java
Libraries
PyTorch, Matplotlib, PySpark, PyCharm, TensorFlow, Pandas, Seaborn, NumPy, ScikitLearn, NLTK, SpaCy, Streamlit
Data Engineering & Platforms
Docker, Kubernetes, Jupyter, Git, MySQL, Tableau, Google Looker, AWS, Apache Airflow, Snowflake