Real-Time Tweet Sentiment Analysis Pipeline

Overview

Implemented an end-to-end streaming analytics pipeline for analyzing tweet sentiment in real-time using Apache Spark Structured Streaming and Delta Lake. The system processes Twitter data through a medallion architecture (Bronze→Silver→Gold) and performs sentiment classification using a pre-trained BERT-based transformer model. Designed for large-scale streaming data with fault tolerance, exactly-once processing, and comprehensive real-time monitoring.

Problem Statement

Social media platforms generate massive volumes of unstructured text data that require real-time sentiment analysis for brand monitoring, trend detection, and public opinion tracking. Traditional batch processing cannot handle streaming data at scale, and existing solutions lack fault tolerance and exactly-once processing guarantees. Processing 41,000+ tweet files while maintaining data quality and performance presents significant engineering challenges.

Approach

Built a multi-layer medallion architecture using Spark Structured Streaming. Bronze layer ingests raw JSON tweets with schema validation, Silver layer extracts mentions and cleans text with timestamp conversion, and Gold layer applies ML inference using a fine-tuned BERT model (finiteautomata/bertweet-base-sentiment-analysis) via MLflow. Implemented Delta Lake for ACID transactions and time-travel capabilities. Configured dynamic partitioning scaled to cluster size, checkpoint-based fault recovery, and stream health monitoring. Optimized performance with memory management and automatic small file compaction.

Results

Successfully processed 41,000+ tweet files achieving 82.1% weighted precision, 55.7% recall, and 66.3% F1-score in sentiment classification across positive, negative, and neutral classes. The pipeline maintained stable throughput with real-time metrics tracking for input rate, processing latency, and resource utilization. Demonstrated exactly-once processing guarantees with automatic checkpoint recovery. Complete processing time of ~70 minutes for full dataset with fault-tolerant execution. Generated sentiment distribution analytics and temporal trend analysis for user mentions.