0% found this document useful (0 votes)

5 views9 pages

Code File Analysis

Uploaded by

rndintern.hitachi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

Code File Analysis

Uploaded by

rndintern.hitachi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

bussat-claire

STAGE 1
1: Exploratory data analysis, preprocessing and cleaning

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11188 entries, 2 to 11095
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 11188 non-null object
1 content 11187 non-null object
2 datatype 11006 non-null object
3 date 11188 non-null object
4 domain 11096 non-null object
5 esg_topics 11188 non-null object
6 internal 11188 non-null int64
7 symbol 11187 non-null object
8 title 11188 non-null object
9 url 11096 non-null object
dtypes: int64(1), object(9)
memory usage: 961.5+ KB
None

2.compare the number of ESG documents per company.

average length of the content per document

3.Cleaning
We can see that we have a lot of numbers, dates, symbols, punctuation, web sites adresses.
We are going to remove all this and lematize our data to make it ready for analysis.

still a lot of stopwords and single letters lost in the text which we are also going to remove
and tokenize the data at the same time

3.General Analysis
Now we can check the average length of the content by datatype
We can see that ESG comes last in terms of average length of content with annual report that
comes first.

4.Visualization
Now we will visualize which words come most frequently
We see that some of the words showing up the most are verbs or citation text like 'et al'. To
obtain something more relevant and precise we will work with TF-IDF vectorization, which
will also be useful for further analysis.

5.Time Series
To check the topic distribution over time we will first have a look at all the topics that we
have in the data
Then we are going to convert the data in the date column into date format
STAGE 2
1.Data Annotation
Initialisation: ! pip install transformers

2.Sample Annotation
We only did tokenization for words in stage 1 but now we are going to use sentiment analysis
on sentences so we tokenize again our documents by sentences.

We will perform manual sentiment annotations on a sample of 500 random sentences of our
dataset and extract it as a csv.
now have a samlpe of 500 sentences manually annotated and I want to evaluate three LLMs
on this data.

3.Preprocessing
roBERTa base for sentiment analysis

Now we want to compare the results from the manually annotated data and the annotations
from the model.
We see that in general both annotations have more neutral sentences, then positive sentences
and fewer negative sentences. However, the model has much more neutral than the manual
annotations and therefore fewer positive and a lot fewer negative.

4.distilBERT finetuned
There is no positive sentiment identify in the 300 sentences used which is not realistic.

5.Application to entire data

Based on the result we will use the roberta model.

STAGE 3-4
1.Split the labeled data: 70% train, 15% dev, 15% test.
2.Train a sentiment model that gives scores from 0 (negative) to 1 (positive).
3.Compare average sentiment of internal vs. external texts for each company.
4.Sort companies by the sentiment gap.
5.Manually check if top-gap companies were involved in greenwashing.

1.Setup

2.Train-Test Split
3.Text Vectorization
-With BoW -With TF-IDF
4.Training of Machine Learning Models
1st algorithm: SVM
2nd algorithm: Decision Tree
3rd algorithm: Naive Bayes
4th algorithm: Logistic Regression

5.ML models evauation

To evaluate each model's performance, there are several common metrics in
use:–Precision–Recall–F-score–Accuracy–Confusion

5.Test with a Pretrained Model

-Evaluation before finetuning
-Finetuning the pretrained model

6.Annotating the full dataset

we have trained our models and selected the logistic regression as the best one we had, the
aim is to apply the model to the entire dataset to have it fully annotated for sentiments.
-With Logistic Regression
-With the Textblob Library(BACKUP)

7Sentiment Analysis
-EDA
Participants then compare the average sentiment of internal vs. external data about a
company. They sort the companies based on the difference between internal and external
sentiment and do a manual follow-up research to see if the companies with the biggest gap
have been explicitly involved in greenwashing during the considered timeframe.
-Comparison of internal vs. external data by companies

-Companies with the bigest difference between internal / external data

-Follow up on the results
Both of these companies are, based on our analysis the top 2 with the most differences
between the weighted average sentiments of internal and external documents. We therefore
went to look for scandals associated with their names and found: Beiersdorf AG was accused
in 2002 of falsely proclaiming CO2 emition neutrality for the production of their products. It
confirms the gap we found in our data regarding internal documents and external documents.
Deutsche Bank AG had to let go one of its leaders in 2022 following a scandal related to
allegedly sustainable funds that did not respect the sustainability criterias promised. This also
shows that if a company publishes documents full of promises the false pretenses can
transpire in the data in the end.
Arian Contessotto
STAGE 1
1. Prerequisites and Load
Import Packages and Make Downloads
1.2 Load Data

2. Data Preprocessing
Data Cleaning
The data cleansing includes the following transformations:
• The columns 'domain' and 'url' are removed from the dataframe as they contain many null values and do not provide important information.
• The columns are reordered.
• The name of Munich Re is changed to Munich RE (looks nicer later on in graphical representations).
• The dataframe is sorted by the column 'company'.
• There are duplicates that need to be removed from the dataframe. 6 duplicates are full duplicates. For these, the first entry is kept
(keep='first'). There are also 600 duplicates based on the 'content' column. This is due to external reports that contain information about
several companies within one article. This means, for example, that a document with the same content is stored once for Allianz and once
for BMW. These contents would interfere with the analysis, so they must be completely removed from the dataframe (keep=False).
• There is a null value for the column 'content'. This row therefore provides no meaningful content for this project and is deleted.
• For the companies Fresenius (6) and Hannover R AG (2) only very few documents exist. Hannover R AG also has no external documents
in the dataframe. These numbers are considered as too small to lead to a meaningful analysis. Therefore, all lines concerning these two
companies are removed from the dataframe.
2.2 Text Preprocessing
The following text preprocessing steps have been considered. An explanation is given as to
why they are or are not applied:
• Language detection and removal:
• Lowercase:
• Expand contractions:
• Remove URL and email:
• Remove punctuation:
• Removal of numbers:
• Removal of emojis and emoticons:
• Spelling correction:
• Word tokenisation:
• Sentence tokenisation:
• Lemmatisation and stemming:

3. Exploratory Data Analysis

3.1.1 Internal / External Reports
3.1.2 Number of Reports by Company(The most reports are available for Adidas AG. The
fewest reports are available for Munich RE.)
3.1.3 Number of Reports by Industry(Most reports come from the pharmaceutical and
automotive industries)
3.1.4 Number of Reports by Datatype(Business, general and tech reports)
3.1.5 Number of ESG-Topics(356)(Social and environment)

3.2 Exploratory Text Analysis( In a first step, two basic features are calculated: Length of
content and polarity.)
3.2.1 Wordclouds(In order to get a feeling of which terms can be significant in which
industry, wordclouds are created for all industries.)
3.2.2 Length of Reports(Most reports are between 400 and 499 words long. External
reports-short, internal reports-long)
3.2.3 N-Gramming
3.2.4 Polarity

4. TF-IDF Analysis(indicates how important a word is in a document or corpus of

documents.)

5. Time Series Analysis(comparing the occurrence of esg issues over time)

STAGE 2
1. Manual Text Annotation
(For manual annotation, a classical sentiment classification approach was applied. That is, sentences were primarily classified according to
negative (level = 0, label = negative), neutral (level = 0.5, label = neutral), and positive (level = 1, label = positive) meaning. For example:
• The strategy of the last years was very successful => Positive.
• The strategy of the last years was alright => Neutral.
• The strategy of the last years turned out to be disadvantageous for the company => Negative
Where possible and useful, an attempt was made to classify sentiment in relation to esg topics. However, this was often not possible at
sentence level outside context, so a classical sentiment classification approach was cosidered the best approach. Classification in terms of
greenwashing was also deemed impractical as no information is available on this. Thus, it can not be seriously assessed whether positive or
negative sentences in relation to esg topics are true or whether greenwashing is present.)

2. Application of Pre-Trained LLM's for Annotation

(five different LLM's are applied to the annotated dataset. Four of them are applied with a
zero-shot strategy. For one model, a few-shot-prompting strategy was used.)
2.1 Zero-Shot-Classification
The following models were used for the zero-shot classification:
• distilbert-base-uncased-finetuned-sst-2-english:
• siebert/sentiment-roberta-large-english:
• ahmedrachid/FinancialBERT-Sentiment-Analysis:
• facebook/bart-large-mnli

2.2 Few-Shot-Classification with GPT-3

(For the few-shot classification, the GPT-3 language model was chosen because it is a
question-answer model that allows the input of contextual information and can be used for
sentiment classification, among other things.)
3. Comparison and Evaluation of LLM's
(The evaluation of the different LLM's is done by comparing the distribution of the sentiment levels (0, 0.5 and
1). Furthermore, the sum of the absolute deviation between the LLM annotation and the manual annotation is
computed and compared.)
Manual annotation had mostly positive sentences.
BERT, RoBERTa, and BartLarge also predicted mostly positive sentiments.
FinancialBERT and GPT-3 predicted mostly neutral sentiments.
BERT and GPT-3 gave the most reasonable sentiment distributions (3 examples each).
BERT’s score range (0 to 1) is similar to manual annotation.
GPT-3 gave the most balanced sentiment distribution overall.
GPT-3 (with 3 example sentences) was the best performing model.
But GPT-3 is not free and using it fully would be expensive.

So, BERT (DistilBERT version) was chosen instead, as the second-best and free option.

5. Dataset Annotation
(Finally, each sentence token in the entire dataset is annotated using the BERT model as
described above.)

STAGE 3
1. Import Packages & Downloads

2. Model Finetuning
The evaluation for the model is based on the following conceptual approach:
1. Select multiple pretrained (Huggingface) models, based on previous stages
2. Train the selected models on a subset of the single sentences to keep the training time short
3. Compare the training outcomes of the different models on the subset and select the best
model

1.1 Finetune Model 1: distilbert-base-uncased

1.2 Finetune Model 2: roberta-base
1.3 Finetune Model 3: xlnet-base-cased
1.4 Finetune Model 4: flan-t5-base (not working correctly)

3. Model Evaluation
3.1 Finetuning/Training Metrics
3.2 Inference/Test Metrics
(In general all models demonstrate bad performance according to the MSE, MAE and R2 on
completely unseen, new data.
Surprisingly, XLNet performs the best out of the 3 models on completely new data.)
(According to the metrics from the finetuning, we expect the best results from a RoBERTa
model even if the model did not show a good inference performance.
Therefore RoBERTa will be finetuned on the complete sentence dataset.)

4. Full Training of selected Model

5. Evaluation of fully trained Model & Sentiment Prediction

6. Compare internal vs. external

(The finetuned model is now used to predict the sentiment of all sentences in the documents.)
(We have trained the full RoBERTa training with 2 different stage2 outputs e.g., different
class/sentiment distributions.
• The initial full training with a quite imbalanced training dataset
• A second full training with a more balanced training dataset
The datasets were both slightly adjusted in discrete class distributions before the training
Both finetuned RoBERTA models were used to perform a sentiment analysis on all
sentences.)

8.1Comparison of internal/external Sentiments on Company Level

(Finally, we compare the internal and external sentiment scores on a company level. We
display the results for both trained classifiers.
Classifier 1 is the finetuned RoBERTa model on dataset 1 (imbalanced dataset). Classifier 2
is the finetuned RoBERTa model on dataset 2 (more balanced dataset).)

(We can conclude from the sentiment analysis that both classifiers 1 and 2 did not detect any
significant greenwashing patterns. Only for the company Qiagen is there possibly
greenwashing due to the result of classifier 2. However, since the classifiers achieved rather
poor results in the model evaluation, this analysis should be treated with caution.)

STAGE-4
Alignment with Sustainable Development Goals

1. Build Embeddings

GPU NEEDED

2. SDG Alignment of DAX Companies

(We model SDG alignment as the similarity between the company-related texts and the SDG
descriptions. In this section, we first define the similarity function using standard cosine
similarity. We then perform some alignment analysis including visualizations and
interpretations. All analysis are executed on a company, sector and industry level.)

3.1 Most Relevant SDGs for DAX Companies - Overview

3.2 Most Relevant SDGs for DAX Companies - Company Level

(n this analysis, we focus on the most important SDGs at the company level. First, we take a
closer look at a specific company defined by the variable COMPANY. We find the 'internal'
and 'external' embeddings for this company, average them, and measure their similarity to
each of the SDGs. We then aggregate and summarize the results for all companies by
displaying heatmaps.)
3.2 Most Relevant SDGs for DAX Companies - Sector Level

3.3 Most Relevant SDGs for DAX Companies - Industry Level

Greenwashing Report
No ratings yet
Greenwashing Report
23 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Stock Sentiment Analysis Using Ai
No ratings yet
Stock Sentiment Analysis Using Ai
17 pages
Kavin
No ratings yet
Kavin
13 pages
Sma Exp 3
No ratings yet
Sma Exp 3
7 pages
Stage 1 - Data Ingestion and Organization
No ratings yet
Stage 1 - Data Ingestion and Organization
9 pages
Finaldoc
No ratings yet
Finaldoc
19 pages
Data Preprocessing Visualization
No ratings yet
Data Preprocessing Visualization
25 pages
Final Document
No ratings yet
Final Document
14 pages
Internship Presentation
No ratings yet
Internship Presentation
15 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
46 pages
121a1114 D2 Sma Exp3
No ratings yet
121a1114 D2 Sma Exp3
9 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
22 pages
Report Shawari
No ratings yet
Report Shawari
10 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Course Outline DPA
No ratings yet
Course Outline DPA
5 pages
Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Data Science Tools Guide
No ratings yet
Data Science Tools Guide
26 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
Data Science Tasks Overview
No ratings yet
Data Science Tasks Overview
9 pages
Data Task Breakdown
No ratings yet
Data Task Breakdown
12 pages
L2 Data Crawling Preprocessing
No ratings yet
L2 Data Crawling Preprocessing
30 pages
Text Data Cleaning Steps in Python
No ratings yet
Text Data Cleaning Steps in Python
6 pages
Prac 7
No ratings yet
Prac 7
5 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
Python Data Analytics Course Guide
No ratings yet
Python Data Analytics Course Guide
36 pages
Learneverythingai 1
No ratings yet
Learneverythingai 1
9 pages
Data Analytics
No ratings yet
Data Analytics
24 pages
EDA Mini - Report
No ratings yet
EDA Mini - Report
24 pages
Intro Slides
No ratings yet
Intro Slides
18 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
TDS Notes Jan22 Term
No ratings yet
TDS Notes Jan22 Term
8 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Data Retrieval & Cleaning Guide
No ratings yet
Data Retrieval & Cleaning Guide
35 pages
Senior Resume
No ratings yet
Senior Resume
3 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
No ratings yet
Data Preprocessing AND Data Cleansing: By-Ahtesham Ullah Khan 1604610013 CS-3 Yr
12 pages
III Unit
No ratings yet
III Unit
4 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
CV Data Scientist Machine Learning Deep Learning
No ratings yet
CV Data Scientist Machine Learning Deep Learning
3 pages
Big Data Course for Developers
No ratings yet
Big Data Course for Developers
20 pages
Sample Resume - 1yr DS
No ratings yet
Sample Resume - 1yr DS
2 pages
Module 3
No ratings yet
Module 3
76 pages
Notes On Intro To Data Science Udacity
No ratings yet
Notes On Intro To Data Science Udacity
8 pages
Toxic Comment Analysis Report
No ratings yet
Toxic Comment Analysis Report
20 pages
Part A
No ratings yet
Part A
24 pages
Python & Data Analytics Internship Review
No ratings yet
Python & Data Analytics Internship Review
20 pages
Professional PPT NLP Preprocess
No ratings yet
Professional PPT NLP Preprocess
11 pages
Udacity Dandsyllabus
No ratings yet
Udacity Dandsyllabus
7 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Term Paper Report ON Emergency Smart Ambulance Services Integrated With Traffic Monitoring System
No ratings yet
Term Paper Report ON Emergency Smart Ambulance Services Integrated With Traffic Monitoring System
55 pages
Solution
No ratings yet
Solution
18 pages
A Method For The Detection of Fake Reviews Based On Temporal Features of Reviews and Comments
No ratings yet
A Method For The Detection of Fake Reviews Based On Temporal Features of Reviews and Comments
13 pages
Self Organizing Maps
No ratings yet
Self Organizing Maps
22 pages
DWM Exp6 A49
No ratings yet
DWM Exp6 A49
7 pages
W1M1-Intro To ML
No ratings yet
W1M1-Intro To ML
19 pages
INTERNSHIP
No ratings yet
INTERNSHIP
27 pages
CV Igor Chiriac en
No ratings yet
CV Igor Chiriac en
2 pages
Online Learning Prospectus
No ratings yet
Online Learning Prospectus
38 pages
Unleashing Potential of Employees Through Artificial Intelligence
No ratings yet
Unleashing Potential of Employees Through Artificial Intelligence
3 pages
140+ +Use+Model+Explainer
No ratings yet
140+ +Use+Model+Explainer
22 pages
ODSC Machine Learning Guide V1.1
No ratings yet
ODSC Machine Learning Guide V1.1
6 pages
MCS 224 2
No ratings yet
MCS 224 2
5 pages
Week 8
No ratings yet
Week 8
70 pages
Professional Certification Program in Software Development Engineering
No ratings yet
Professional Certification Program in Software Development Engineering
15 pages
MA 3D Games Art Thesis FranciscoMurias
No ratings yet
MA 3D Games Art Thesis FranciscoMurias
21 pages
Speech Emotion System Full Project Report
No ratings yet
Speech Emotion System Full Project Report
54 pages
CS 229 - Deep Learning Cheatsheet
No ratings yet
CS 229 - Deep Learning Cheatsheet
6 pages
Health Index Analysis of XLPE Cable Insulation Using Machine Learning Technique
No ratings yet
Health Index Analysis of XLPE Cable Insulation Using Machine Learning Technique
7 pages
Publications DR Riktesh
No ratings yet
Publications DR Riktesh
3 pages
Delta Rule
No ratings yet
Delta Rule
3 pages
Smart Car Accident Detection System
No ratings yet
Smart Car Accident Detection System
8 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
EBSCO-FullText-04 03 2025
No ratings yet
EBSCO-FullText-04 03 2025
9 pages
IG1339 Autonomous Networks L4 HVS
No ratings yet
IG1339 Autonomous Networks L4 HVS
58 pages
Text Mining Applications and Theory
100% (1)
Text Mining Applications and Theory
5 pages
Stock Market Prediction of NIFTY 50 Index Applying Machine Learning Techniques
No ratings yet
Stock Market Prediction of NIFTY 50 Index Applying Machine Learning Techniques
25 pages
Clusters - Density-Based
No ratings yet
Clusters - Density-Based
12 pages
Intelligent Systems in Accounting Finance and Management - 2024 - Setty - Cost Sensitive Machine Learning To Support
No ratings yet
Intelligent Systems in Accounting Finance and Management - 2024 - Setty - Cost Sensitive Machine Learning To Support
17 pages
LNCTU-BCA-AI-DA-I Sem - Syllabus
No ratings yet
LNCTU-BCA-AI-DA-I Sem - Syllabus
14 pages

Code File Analysis

Uploaded by

Code File Analysis

Uploaded by

bussat-claire

2.compare the number of ESG documents per company.

5.Application to entire data

5.ML models evauation

5.Test with a Pretrained Model

6.Annotating the full dataset

-Companies with the bigest difference between internal / external data

3. Exploratory Data Analysis

4. TF-IDF Analysis(indicates how important a word is in a document or corpus of

5. Time Series Analysis(comparing the occurrence of esg issues over time)

2. Application of Pre-Trained LLM's for Annotation

2.2 Few-Shot-Classification with GPT-3

1.1 Finetune Model 1: distilbert-base-uncased

4. Full Training of selected Model

5. Evaluation of fully trained Model & Sentiment Prediction

6. Compare internal vs. external

8.1Comparison of internal/external Sentiments on Company Level

2. SDG Alignment of DAX Companies

3.1 Most Relevant SDGs for DAX Companies - Overview

3.2 Most Relevant SDGs for DAX Companies - Company Level

3.3 Most Relevant SDGs for DAX Companies - Industry Level

You might also like