This project aims to develop a natural language processing (NLP)-based classification system to automatically identify whether a tweet is related to a real disaster event.
It is a binary classification task, where tweets must be labeled as:
1
: Disaster-related0
: Not disaster-related
- Linguistic Ambiguity: Distinguishing literal disaster descriptions from metaphorical expressions (e.g., “on fire” can mean excitement rather than fire).
- Context Understanding: Identical words may convey different meanings in different contexts.
- Data Imbalance: The number of non-disaster tweets is significantly higher than disaster-related ones.
Tweet | Label | Description |
---|---|---|
"I'm on fire tonight! Winning all my poker games!" | 0 | Metaphorical expression, not related to a disaster |
"Forest fire near La Ronge Sask. Canada. Residents asked to evacuate immediately." | 1 | Describes a real disaster event |
Filename | Description | Records |
---|---|---|
train.csv |
Training data with tweet content and labels | 7,613 |
test.csv |
Testing data without labels | 3,263 |
sample_submission.csv |
Sample submission format for Kaggle | - |
Column | Type | Description |
---|---|---|
id |
Integer | Unique tweet identifier |
text |
String | Tweet content (main analysis target) |
location |
String | Tweet location (can be null) |
keyword |
String | Associated keyword (can be null) |
target |
Integer | Disaster label (0 or 1 ) — only in training data |
-
Imbalance Observed:
- Disaster tweets: ~3,271 (43%)
- Non-disaster tweets: ~4,342 (57%)
-
Potential Impact: Class imbalance may cause the model to bias toward the majority class.
-
Solutions:
- Use
class_weight='balanced'
in the model - Apply oversampling techniques like SMOTE
- Prefer F1-score over accuracy for evaluation
- Use
-
Length Concentration: Most tweets fall within 135–140 characters
-
Class Differences:
- Disaster tweets are generally shorter and more concise
- Non-disaster tweets tend to be longer with more modifiers
-
Solutions:
- Normalize sequence lengths via
padding
andtruncation
- Consider tweet length as an additional feature
- Set
max_length ≈ 140
to match tweet characteristics
- Normalize sequence lengths via
-
Common Words: “one”, “people”, “got”, “New”, “going”, etc.
-
Disaster Keywords: “fire”, “emergency”, “dead” are strong indicators
-
Noise Removal: Terms like “amp”, “https” often appear in URLs and should be cleaned
-
Solutions:
- Use TF-IDF or BERT tokenizer for enhanced text representation
- Create a custom disaster keyword dictionary
- Clean URLs and special characters during preprocessing
-
Missing Location Issue:
- Around 2,500 tweets have unknown locations (majority of the data)
-
Geographic Patterns:
- Tweets with known locations show higher disaster tweet ratios in India, Mumbai, USA
-
Solutions:
- Lower the weight of location features in the model
- Consider indirect geographic indicators if possible
- Bidirectional LSTM model captures sequential dependencies and context.
- Serves as a strong baseline for handling tweet-level textual data.
- BERT (Bidirectional Encoder Representations from Transformers) excels at contextual understanding.
- Especially effective on informal language patterns commonly seen in tweets.
- Widely adopted in Kaggle solutions.
- To further improve comprehension and generalization, DeepSeek LLM is considered.
- Provides human-like understanding with simple implementation and integration.
- Combine DeBERTa and RoBERTa for ensemble learning:
- DeBERTa improves attention and position encoding
- RoBERTa is an optimized BERT variant with better training strategy
- Goal: Leverage complementary strengths for more stable and accurate predictions, especially beneficial in competition settings.
This project is based on the Kaggle challenge: [Real or Not? NLP with Disaster Tweets].
It focuses on text mining, model experimentation, and multi-model integration for better performance.