NLP Disaster Tweet Classification Project

1. Project Overview

1.1 Objective

This project aims to develop a natural language processing (NLP)-based classification system to automatically identify whether a tweet is related to a real disaster event.
It is a binary classification task, where tweets must be labeled as:

1: Disaster-related
0: Not disaster-related

1.2 Technical Challenges

Linguistic Ambiguity: Distinguishing literal disaster descriptions from metaphorical expressions (e.g., “on fire” can mean excitement rather than fire).
Context Understanding: Identical words may convey different meanings in different contexts.
Data Imbalance: The number of non-disaster tweets is significantly higher than disaster-related ones.

1.3 Example Illustrations

Tweet	Label	Description
"I'm on fire tonight! Winning all my poker games!"	0	Metaphorical expression, not related to a disaster
"Forest fire near La Ronge Sask. Canada. Residents asked to evacuate immediately."	1	Describes a real disaster event

2. Dataset Description

2.1 Dataset Files

Filename	Description	Records
`train.csv`	Training data with tweet content and labels	7,613
`test.csv`	Testing data without labels	3,263
`sample_submission.csv`	Sample submission format for Kaggle	-

2.2 Column Description

Column	Type	Description
`id`	Integer	Unique tweet identifier
`text`	String	Tweet content (main analysis target)
`location`	String	Tweet location (can be null)
`keyword`	String	Associated keyword (can be null)
`target`	Integer	Disaster label (`0` or `1`) — only in training data

3. Data Analysis

3.1 Class Distribution Analysis

Imbalance Observed:
- Disaster tweets: ~3,271 (43%)
- Non-disaster tweets: ~4,342 (57%)
Potential Impact: Class imbalance may cause the model to bias toward the majority class.
Solutions:
- Use class_weight='balanced' in the model
- Apply oversampling techniques like SMOTE
- Prefer F1-score over accuracy for evaluation

3.2 Tweet Length Analysis

Length Concentration: Most tweets fall within 135–140 characters
Class Differences:
- Disaster tweets are generally shorter and more concise
- Non-disaster tweets tend to be longer with more modifiers
Solutions:
- Normalize sequence lengths via padding and truncation
- Consider tweet length as an additional feature
- Set max_length ≈ 140 to match tweet characteristics

3.3 Word Cloud & Keyword Analysis

Common Words: “one”, “people”, “got”, “New”, “going”, etc.
Disaster Keywords: “fire”, “emergency”, “dead” are strong indicators
Noise Removal: Terms like “amp”, “https” often appear in URLs and should be cleaned
Solutions:
- Use TF-IDF or BERT tokenizer for enhanced text representation
- Create a custom disaster keyword dictionary
- Clean URLs and special characters during preprocessing

3.4 Location Analysis

Missing Location Issue:
- Around 2,500 tweets have unknown locations (majority of the data)
Geographic Patterns:
- Tweets with known locations show higher disaster tweet ratios in India, Mumbai, USA
Solutions:
- Lower the weight of location features in the model
- Consider indirect geographic indicators if possible

4. Planned Modeling Approaches

4.1 Traditional Sequence Model: BiLSTM

Bidirectional LSTM model captures sequential dependencies and context.
Serves as a strong baseline for handling tweet-level textual data.

4.2 Contextual Embedding Model: BERT

BERT (Bidirectional Encoder Representations from Transformers) excels at contextual understanding.
Especially effective on informal language patterns commonly seen in tweets.
Widely adopted in Kaggle solutions.

4.3 Large Language Model (LLM): DeepSeek

To further improve comprehension and generalization, DeepSeek LLM is considered.
Provides human-like understanding with simple implementation and integration.

4.4 Model Ensemble: DeBERTa + RoBERTa

Combine DeBERTa and RoBERTa for ensemble learning:
- DeBERTa improves attention and position encoding
- RoBERTa is an optimized BERT variant with better training strategy
Goal: Leverage complementary strengths for more stable and accurate predictions, especially beneficial in competition settings.

Notes

This project is based on the Kaggle challenge: [Real or Not? NLP with Disaster Tweets].
It focuses on text mining, model experimentation, and multi-model integration for better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
final_project/DL_team7/DL_team7		final_project/DL_team7/DL_team7
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Disaster Tweet Classification Project

1. Project Overview

1.1 Objective

1.2 Technical Challenges

1.3 Example Illustrations

2. Dataset Description

2.1 Dataset Files

2.2 Column Description

3. Data Analysis

3.1 Class Distribution Analysis

3.2 Tweet Length Analysis

3.3 Word Cloud & Keyword Analysis

3.4 Location Analysis

4. Planned Modeling Approaches

4.1 Traditional Sequence Model: BiLSTM

4.2 Contextual Embedding Model: BERT

4.3 Large Language Model (LLM): DeepSeek

4.4 Model Ensemble: DeBERTa + RoBERTa

Notes

About

Uh oh!

Releases

Packages

Languages

ELITERAY/Natural-Language-Processing-with-Disaster-Tweets

Folders and files

Latest commit

History

Repository files navigation

NLP Disaster Tweet Classification Project

1. Project Overview

1.1 Objective

1.2 Technical Challenges

1.3 Example Illustrations

2. Dataset Description

2.1 Dataset Files

2.2 Column Description

3. Data Analysis

3.1 Class Distribution Analysis

3.2 Tweet Length Analysis

3.3 Word Cloud & Keyword Analysis

3.4 Location Analysis

4. Planned Modeling Approaches

4.1 Traditional Sequence Model: BiLSTM

4.2 Contextual Embedding Model: BERT

4.3 Large Language Model (LLM): DeepSeek

4.4 Model Ensemble: DeBERTa + RoBERTa

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages