Hello ,
I am Era Parihar

A Data Scientist with focus on ML, NLP, Statistics and LLMs check out my Resume here

About Me

Hello! I'm Era Parihar, a Data Scientist passionate about solving real-world problems through analytical reasoning and intelligent systems. I recently completed my Master’s in Data Science at the University of Michigan, Ann Arbor, where I was recognized with a merit-based scholarship from the state of Rajasthan, India, and had the honor of being chosen to speak on behalf of my graduating class at our commencement ceremony inMay 2025

My journey began with a strong foundation in Computer Science from Birla Institute of Technology and Science, Pilani, and has since evolved across a wide range of domains. During my internship at the United Nations, I fine-tuned AI models to enhance climate-document classification and disaster response. I also built an Azure-based Retrieval-Augmented Generation (RAG) pipeline that automatically produces concise, citation-rich summaries—making critical insights more accessible to policy teams.

Previously, I applied machine learning to build predictive models and recommendation systems in the fintech and environmental sectors, driving user engagement and supporting ESG-focused decision-making.

My research at Carnegie Mellon University explored how large language models learn in early training stages, using contrastive learning to fuse speech and text signals—advancing capabilities in multimodal AI.

Equipped with hands-on experience in tools like Python, SQL, Tableau, and a wide suite of ML libraries, I thrive at the intersection of NLP, deep learning, and data-driven strategy. I’m always curious about how data can inform decisions, uncover patterns, and drive meaningful impact.

EDUCATION

Master of Science in Data Science

August 2023 – May 2025

GPA – 3.4 / 4.0

Relevant Coursework

Statistical Modeling, Machine Learning, Deep Learning, NLP, Data Mining, Applied Regression, Time Series Forecasting, Causal Inference, Data Visualization, Data Engineering

Work Experience

Data & AI Intern

United Nations(September 2024 - December 2024)

Engineered and deployed an end-to-end Retrieval-Augmented Generation pipeline in Azure that produces interactive country reports with citations and hyperlinks.
Fine-tuned ClimateBERT on humanitarian-aid datasets, boosting climate-document classification accuracy to 85% and accelerating disaster-response insight extraction.
Partnered with policy and engineering teams to align advanced analytics deliverables with operational goals.
Extra (fill later)

Data Scientist

Deriv Limited (April 2021 - May 2022)

Developed an advanced Top-K recommendation system for Affiliate Managers, increasing user engagement by 33%.
Designed and analyzed A/B tests to evaluate UI and email campaign changes, optimizing user click-through and conversion rates.
Built Apache Spark on Hadoop pipelines for large-scale transactional data; automated affiliate payments using Airflow & PostgreSQL, reducing processing time by 25% and enabling real-time customer lifetime value (CLV) analytics.
Built and deployed Random Forest models for predictive analytics in affiliate performance, integrating BI dashboards (Tableau, Power BI) to drive cross-functional strategic decision-making and real-time insights.

Data Scientist

SCS Enviro Services (July 2020 - March 2021)

Optimized meteorological data extraction workflows and preprocessing using Python, SQL, and Business Intelligence techniques (Tableau, Power BI), improving data analytics efficiency by 15%
Built ETL data pipelines using SQL and Python to consolidate ESG data, developing end-to-end models for quantitative environmental analysis.
Applied ARIMA for air quality forecasting, aiding strategic ESG decision-making.

Machine Learning Intern

Sentient Labs (April 2020 - July 2020)

Built an obstacle detection system using object detection (YOLO) and semantic segmentation techniques (e.g., bounding box detection and pixel classification), tailored for aquatic robot navigation in dynamic waterway conditions.

Projects

MedQuery: Evidence-Based Clinical Decision Support

Developed a complete RAG pipeline: retrieved articles from PubMed using NCBI E-utilities, embedded abstracts with sentence transformers, indexed them using FAISS, and generated chain-of-thought explanations with open-source language models.

Enabled explainable, verifiable answers by linking every claim to source articles — bridging the gap between LLMs and evidence-based medicine.

Tools: Python, Streamlit, Hugging Face Transformers, SentenceTransformers, FAISS, PubMed API, Torch

*currently working on deployemnt for easy accessibility

GitHub ->

Register-Augmented LLM Finetuning

Collaborated with a team to develop a novel “register-augmentation” technique for transformer-based language models (BERT), improving question-answering performance by adding specialized “register” tokens during fine-tuning.

Implemented interpretability methods (Integrated Gradients, Layer-wise Relevance Propagation) to visualize how register tokens enhance the model's focus on task-relevant context, resulting in better F1 and ExactMatch scores compared to standard fine-tuning.

GitHub -> Read Paper ->

Answer-Aware Question Generation System

Fine-tuned T5 and BART on QA datasets to generate high-quality questions, using BLEU, ROUGE, METEOR, and BERTScore to compare models and optimize performance.

GitHub -> Read Paper ->

Ann Arbor Water Production Forecasting

Using Voting Regressor for Time-Series Prediction

Built a predictive model to forecast daily water production in Ann Arbor based on 8 years of weather and usage data. Engineered temporal features and environmental indicators to capture seasonal demand fluctuations driven by rainfall and temperature.

Implemented and benchmarked multiple regression models — including Random Forest, Gradient Boosting, and Ridge and combined them using a Voting Regressor for improved accuracy and stability.

Enabled data-driven water resource planning by predicting future demand and reconstructing sensor gaps, supporting the city’s sustainability goals.

Tools: Python, Scikit-learn, Voting Regressor, Pandas, Matplotlib, NumPy

GitHub ->

ThermDepth

Forecasting Subsurface Lake Temperatures for Environmental Intelligence

Predicted hourly temperatures at 10.5m depth in Trout Lake (2018–2019) using multivariate time-series data from 2012–2018 across varying depths.

Engineered temporal and spatial features; benchmarked models including Random Forest, XGBoost, and LSTM for best performance.

Supported climate-aware ecological monitoring by reconstructing missing sensor data with high accuracy.

Tools: Python, Scikit-learn, XGBoost, LSTM, Pandas, Matplotlib

GitHub ->

PotholeSense

Predicting Pothole Severity in Chicago Using LightGBM for Smart Infrastructure Maintenance

Built a supervised machine learning pipeline using LightGBM to classify pothole severity across Chicago, utilizing 311 service request data, traffic volume, and road condition features. The model achieved 92% precision and 88% recall in identifying high-risk potholes, enabling data-driven prioritization of repair work

Engineered predictive features including response lag, weather-seasonality, and zip-code-based clustering. Applied robust preprocessing—missing value imputation, label encoding, and outlier handling to prepare the data for training.

Benchmarked LightGBM against Decision Trees, Random Forests, and XGBoost using stratified cross-validation and grid search. LightGBM consistently delivered the best performance with low inference time.

Tools Stack: Python, LightGBM, Scikit-learn, XGBoost, Decision Trees, Random Forests, Feature Engineering, Hyperparameter Tuning, Cross-validation, Pandas, Geopandas, Matplotlib, Seabornb

GitHub ->

Articles

Latest posts on Machine Learning, NLP, and Generative AI

Skills

Technical strengths across ML, Programming, Data Engineering & Analytics

Machine Learning & Statistics

LLMs, NLP, LangChain, Deep Learning, Predictive Models, Decision Trees, Clustering, Regression, Experimentation, Hypothesis Testing, A/B Testing, Time Series Forecasting, Descriptive & Inferential Stats, Quantitative & Data Analysis.

Programming Languages

Python, SQL, NoSQL, R, SAS, Matlab, Java

Libraries

PyTorch, Matplotlib, PySpark, PyCharm, TensorFlow, Pandas, Seaborn, NumPy, ScikitLearn, NLTK, SpaCy, Streamlit

Data Engineering & Platforms

Docker, Kubernetes, Jupyter, Git, MySQL, Tableau, Google Looker, AWS, Apache Airflow, Snowflake

Hello ,I am Era Parihar

About Me

EDUCATION

Relevant Coursework

Work Experience

Data & AI Intern

Data Scientist

Data Scientist

Machine Learning Intern

Projects

MedQuery: Evidence-Based Clinical Decision Support

Register-Augmented LLM Finetuning

Answer-Aware Question Generation System

Ann Arbor Water Production Forecasting

Using Voting Regressor for Time-Series Prediction

ThermDepth

Forecasting Subsurface Lake Temperatures for Environmental Intelligence

PotholeSense

Predicting Pothole Severity in Chicago Using LightGBM for Smart Infrastructure Maintenance

Articles

Skills

Machine Learning & Statistics

Programming Languages

Libraries

Data Engineering & Platforms

Hello ,
I am Era Parihar