[go: up one dir, main page]

0% found this document useful (0 votes)
34 views19 pages

Operational Efficiency Report

Uploaded by

Elalmi anas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views19 pages

Operational Efficiency Report

Uploaded by

Elalmi anas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Client Visit Report

Enhancing Operational Efficiency through


Collaborative Problem-Solving

Date: 09/08/2023
Prepared by: Anas EL ALMI & Mounia RAOUI
Table of contents
Forward
Context of GIO's visit
WIFI
TOOLS
HPNA
ZTP
Syslog Server
DNA Center
Escalation from L1 to L2, then L3 - Process & Automation
NNMi
LAN Light/Light+
ServiceDesk Issues
Forword about the GIO's visit
to SFR/Intelcia teams in
Morocco
We are pleased to present this comprehensive report detailing the results of the recent
GIO’s visit to Moroccan teams. The purpose of this visit was to engage with your esteemed
organization's operational teams and address existing operational challenges that were
affecting your business processes.
At the outset, we are grateful for the opportunity to engage directly with your operational
teams. This session gave us valuable insight into the day-to-day challenges your organization
faces and gave us a better understanding of your operational challenges. The client’s
commitment to collaboration and innovative solutions was evident throughout the visit.
During the visit, we had the honor of being part of a creative workshop where we came up
with ideas together and explored possibilities. The goal was to identify practical ways to
eliminate the operational constraints that hindered your business’s goals. We are confident
that the collective expertise and commitment of our two teams will contribute to the
development of effective implementation plans and ensure their successful implementation.

This report aims to capture the essence of the visit, focusing on three main areas:
Context of GIO's visit to
SFR/Intelcia teams in Morocco
1- Context of the Visit:
 A brief overview of the purpose and objectives of the GIO’s visit
 The background of our engagement and previous interactions with your teams.
 The significance of collaborating with your operational teams for problem solving.

2- Defining Operational Challenges:

 A comprehensive analysis of the operational issues raised during the discussions.


 An in-depth understanding of the root causes of these challenges.
 The potential impact of these challenges on your organization's overall performance.

3- Action Plans for Improvement:

 A detailed outline of the action plans developed during the collaborative workshop.
 Clear and achievable milestones to track progress and measure success.
 Roles, responsibilities, and timelines for each action item.

We believe that this report will serve as a valuable reference for both our organizations as
we work towards enhancing operational efficiency and achieving mutual growth. Our
collective commitment to addressing these challenges demonstrates the strong partnership
between our companies and our shared vision for success.

Once again, we extend our heartfelt appreciation for hosting us and providing the
opportunity to work closely with your dedicated teams. We look forward to executing the
action plans and witnessing the positive impact on your organization.

Should you have any questions or require further information, please do not hesitate to
reach out to us. We remain fully committed to ensuring the successful implementation of
the proposed solutions.
WIFI
Context:
In order to improve wireless access quality and security, GIO launched an AP Refresh project;
WIND (Wireless Network for Digital) which aims to refresh the global Wifi infrastructure and
secure wireless access through a NAC Policy server for EMEA, AMEI, NAM, LATAM and APAC
areas for both AL & Airgas entities. The project gathers 2 main lots:

1. Wind 1: Central infra implementation (WLC, ISE) to deploy on 3 DC and tested with
pilotes sites. It included the activation of new Offices SSID

2. Wind 2: Full or partial refresh for sites whose at least 1 incompatible AP has to be
replaced and Logical migrations for sites whose AP are all compatible.

3. Wind 2: Full or partial refresh for sites whose at least 1 EOS AP has to be replaced.

Wind 1: There are 970 sites for the Logical migration with 3600 APs to be migrated.
Wind 2: There are 3920 Aps (1672 sites) to be replaced and configured in the current WLC.

Target architecture for WIND 1 is based on Cisco solutions


Target architecture for WIND 2 is based on Cisco solutions and ARUBA Solutions as an
option.

Problem:
Following the IOS upgrade of WIND WLCs we have started receiving complaints from sites
especially in Europe and then in APAC for users getting disconnected during the roaming
using OKC. When a user moves [does a roaming], he sometimes completely lose the access
to the Network. He has to disable / enable his Wifi card to recover the Wifi access

Impact: The impact will be the loss of wifi connection for 5minutes the time to deactivate
and reactivate the SSIDs.

Workarounds:
1st workaround: OKC (Opportunistic Key Caching) has been deactivated on Corporate SSIDs
hosted on Wifi controllers.
#The BIS was not at all satisfied with this workaround because the objective of the
upgrade is to be able to benefit from the fast roaming (802.11r) functionality, whereas the
result was to disable OKC. Customer is feeling like we have had a regression.
2nd workaround: changing IPv6 mgmt and reconfiguring the Mobility group domain name
and the WLAN Session timeout at 86400s (without WLAN shut/ no shut), we can now have
roaming in 802.11R.
#The issue was that the initial association was not working in FT/802.11R mode due to the
fact that PMK-R0, PMK-R1 KHIDs were at 0s in Association response and no M3 sent to
client.

3nd workaround: the creation of a script that automates the execution of the 2nd
workaround every day at 11pm for the EMEA Area.

Improvement suggestions

1. Ask the solution team to automate script execution on APAC and US WLCs as well
#TO DO: owner GIO + SFR
2. Use of the DNA Center by the operations teams to quickly detect the roaming
problem and inform the customer that it is a known problem and that it will be
solved at 11PM with the execution of the script. # Done
3. Recruit a WIFI expert who will help to better diagnose, audit and improve the WIFI
perimeter #In progress
4. creation of a test lab (4G router + AP + AL PC) in Morocco to reproduce WIFI incidents
to carry out tests without impacting production
TOOLS
Tools Integration Diagram
Context and problematic
Following a change of addition/removal of equipment, the N2 team must update the tools (SNOW,
LiveNX, DNA, IPAM, NNMi, and HPNA).

Since the Update is manual then there is a margin of human error, which could have as consequence:
Incorrect information of the equipment, non-supervision of the equipment, non-saving of the
configuration...).

Waste of time engineer N2

Following the last cyberattack, the configuration of the equipment in question was lost, which slowed
down the neutralization of the attack.

Improvement suggestions
Proposal 1: Automation of the tool update flow: Migration file -> SNOW -> LiveNX -> IPAM -> DNA ->
NNMi -> HPNA.

Pre-requisites: OK AL + ServiceNow API -> NNMi

Proposal 2: Schedule the sending of a ServiceNOW report by email (then use of an ETL (to be
provided) to perform either an automated injection into NNMi or otherwise allow at least to detect
detla via an automated report)

Proposal 3: Use the FDL or another file to feed NNMi/HPNA (Study the feasibility of feeding the FDL
from SNOW (Achievable via ETL).

Proposal 4: Establishment of two FTEs dedicated to updating the tools as well as the control of all the
tools after the changes.

Proposal 5: Propose new tools to replace existing tools.


HPNA
Context and problematic
Once the integration is launched in NNMi, then the equipment is discovered via snmp and its status
goes up, then NNMI synchronizes with HPNA every 1 hour (minimum time), to add the list of new
equipment in HPNA.

Sometimes the equipment is badly discovered for several reasons: bad driver, SNMP version, SSH not
configured, bad login/psswd, absence of the data transfer protocol....

Improvement suggestions
1- Sensitization of engineers regarding the verification of the HPNA one hour after integration

2- Implementation of a file for monitoring the update of the tools filled in by the engineers and
checked by the RT+TL each week and in case of forgetting; the engineer is notified for rectification.

3- Dedicate 2 hours of engineering for the update of the D-Day exchange tools

4- Daily sending of an HPNA report on the new additions of equipment in HPNA in order to have a list
to control -> Action TL.

5- Automate the comparison of the contents of the daily HPNA report and the exchange file to detect
deltas.

6- Send weekly reports of failed snapshots to correct detected problems.

7- Enforce the TDC on HPNA in order to improve the troubleshoot of the problems.
ZTP
Context and problematic:
Staging is done manually today, which increases the margin for human error.

Aruba ZTP solutions are used as part of the OT project and not the RUN

The ZTP Aruba Airwave solution is not compatible with the Aruba CX 6200/6300 Switch Series.
Therefore, the Airwave solution will be forsaken in favor of the Aruba Central Solution, which is
compatible with CX6xxx and will be used for the ZTP of Aruba switches for the OT project.

Improvement suggestions

Use Aruba Central as a ZTP solution for staging Aruba switches for RUN changes.

Prerequisites: GIO endorsement + Licenses for N2 and N3 engineers


SYSLOG Server:
Context and problematic:
The syslog server used today is the property of Air Liquide, but the MCO is not assured, in the event
of a problem on the server, the syslog is no longer used.

The Benefits of using Syslog server:

Using a syslog server offers several benefits in managing and monitoring GIO’s infrastructure and
applications. Syslog is a standard protocol used to send log messages and event data from various
sources within a network to a central server or repository. Here are some reasons why using a syslog
server can be advantageous:

 Centralized Logging: A syslog server allows you to centralize log data from multiple devices,
systems, and applications. This makes it easier to manage and monitor logs, as we can access
all the information in one place, rather than having to check individual devices.
 Efficient Troubleshooting: When issues or errors occur in the network or applications, having
centralized logs can significantly speed up the troubleshooting process. We can quickly
correlate events across different devices to identify the root cause of problems.
 Compliance and Auditing: Many industries and regulations require organizations to maintain
and secure log data for auditing and compliance purposes. A syslog server helps meet these
requirements by storing logs in a secure and organized manner.
 Resource Optimization: By offloading log storage and management to a dedicated server,
We can reduce the impact on the resources of your production systems. This ensures that
your applications and devices can operate efficiently without being burdened by excessive
logging activities.
 Historical Analysis: Syslog servers often provide features for searching, filtering, and
analyzing log data over time. This capability can be invaluable for identifying patterns, trends,
and anomalies, which can help in optimizing system performance and identifying potential
security threats.
 Security Monitoring: Centralized logging can enhance our ability to detect and respond to
security incidents. By aggregating logs from various sources, we can monitor for
unauthorized access attempts, unusual activity, and other signs of potential breaches.
 Alerting and Notifications: Syslog servers can be configured to send alerts and notifications
based on predefined conditions or triggers. This ensures that we are promptly informed of
critical events, enabling us to take timely action.
 Long-Term Storage: Syslog servers often provide options for long-term storage of log data.
This can be useful for historical analysis, compliance reporting, and forensic investigations.

Improvement suggestions
 Ask GIO to ensure the MCO of the syslog server already in place.
 Installation and maintenance in operational condition of a Syslog server by SFR teams.
 Installation and maintenance in operational condition of a Syslog server by Intelcia teams
DNA Center
Context and actions:

Cisco DNA Centre deploys the Wireless Assurance solution to ensure reactive and proactive
monitoring and troubleshooting across your network.

The DNA was deployed with the WIND project in early 2022, the L3s started using it around May
2022 with the audits on FR238 following the wind migration and the problems we had on it. In
September 2022 we made a presentation at GIO on the use of DNA, then in January 2023 we started
workshops for the advanced use of DNA, in early July 2023, a second presentation was made for
advanced use to GIO. In the same month, a second training session was given to both L1 and L2
teams, and around mid-July, the tool began to be used officially by the three skill levels in the context
of WIFI incidents for in-depth troubleshooting.
Escalation from L1 to L2, then
L3 - Process & Automation
Context and problematic

The incident escalation is done through a dedicated process that describes the specific cases to be
escalated to next level. To do so, the L1 team create and receive the incident, perform the primary
checks to understand well the problem on the impacted site, end then escalate ticket if it is a case
defined on the escalation process.

The problem at this level is that number of incident on network backlog is not stable and could
change in few minutes, while the engineer could be in a tshoot session during an hour other tickets
are still in work in progress status at this time, and this action can generate some delay to escalate
the configuration incident to next level.

In order to avoid this type of situations and delays that cause the client dissatisfaction, we are
suggesting the following process, and explaining how it could be automated.

Improvement suggestions

Technical cases to be escalated automatically from L1 to L2 team, once detected:

 Slowness issues & disconnections due to Overload


 SDWAN Architecture: if the issue is still persisting after a t-shoot with ISP & NCR
 Faulty equipment: RMA
 Some cases related to Wifi, SMTP and other type of incidents where it’s mentioned on the
procedure that at a specific step the ticket should be escalated to L2
 Configuration change is necessary to solve the issue.
 CPU Incidents after first checks mentioned on the procedure
 All the checks are performed by L1 but issue not solved.

Cases related to GIO SLA : {*according to hub delivery working hours and days}

P1: 90% are resolved within 4 hours*


P2: 90% are resolved within 8 hours*
GIO SLA's to Customer (BIS)
P3: 90% are resolved within 2 days*
P4: 90% are resolved within 5 days*
Ticket escalations depending on the priority, except the incident where the issue is on the ISP or
Local ISP side, the Incident Manager should review the incident treatment, challenge the ISP and
involve GIO.

 P1 : Follow the dedicated P1 process, Ticket could be escalated after 2 hours treatment by L1
 P2: Ticket to be escalated after 4 hours treatment by L1 then after 2 hours treatment by L2 to
L3.
 P3 : Ticket to be escalated in beginning of 2nd day
 P4 : Ticket to be escalated in the 4th day

How to automate the escalation from L1 to L2? Then from L2 to L3?

 Tickets with Short description containing « ISP Issue » should always be on the L1 queue.
(Snow Action)
 Depending on SLA calculation via ServiceNow, the ticket should be automatically escalated to
next level following the criteria defined above.
NNMi
Detection of reccurent events on same site/item
The Network devices are monitored by the HP NNMi tool. NNMi uses SNMP polling to obtain the
status of each network device, and alarms are triggered when an issue is detected for two
consecutive polling cycles. The polling cycle is the time it takes the tool to poll every monitored
device one time. The polling cycle time is dependent on the amount of devices being monitored and
is typically around 5 minutes. Alarms are typically generated within 15 minutes of the issue
occurring, dependent on the time in which the issue occurs in the current polling cycle.

Problematic :
A site can experience disconnections during less than 5 minutes but it could happen repeatedly for a
day or more. As explained above, if the equipment become reachable after 5 minutes, Netcool will
not generate an alarm then no incident will be created.

Improvement suggestions
Ongoing study to generate a weekly report showing the repetitive events experienced by same site
or equipment on NNMi, and decide during the operational meeting if an incident ticket should be
created or a problem ticket, depending on the historical events and incidents created.
LAN Light/Light+
Context :
The LAN scope with the supervision level Light and Light+ was subscribed with a specific engagement
on the contract :

The incidents on this category of equipment not supervised by SFR, must be created by customer.
And the network team will categorize the ticket, check the described issue and coordinate with all
the necessary third parties to solve the issue.

Problematic :
We detected 51 Switches with supervision level : Lan Light & Lan Light + on SFR moniroing tools, and
after investigations we found that sometimes the GIO delivery manager asked to integrate the
equipment on all our tools during the operational changes.

Based on 2022 data, the proactive incident rate related to Lan Light/Light equipment is : 51%

Improvement suggestions
Remind the Client about the contractual engagement, and that the Lan Light and Lan Light+ are not
concerned by supervision. Confirm if all Lan Light/Light+ equipments should be retired from NNMi
and Netcool.
ServiceDesk Issues
Context :
The end user can call or send an email to the L1 network team, also create directly a ticket on MMS
and assign it to network group if he is facing network and connectivity issues.

The Service Desk also could receive complains about breakdowns experienced by clients on specific
sites, and after checks they can assign ticket to Network support group, or create an incident task in
case of doubt.

Problematic :
Approximatively 4% of the incident tickets on L1 Network backlog are received from Service Desk,
but an important number of issues are not related to SFR supervised equipment neither the network
problems.

For example, incidents about applications issues when client is working from home instead of the
office, or only one user impacted on specific site. An old ticket on Service Desk queue assigned to
Support team so here the impact is also on the global SLA, or new tickets on ServiceNow with
incorrect information fields (locations, problem description...)….

Improvement suggestions
In order to get incident tickets assigned to the correct group, and solve all issues under the different
operational engagements, we will share a dedicated template to be used by Service Desk before
assigning the incidents tickets. It will contain a couple of checks to perform with client, that will
confirm or not a network issue. Once confirmed the ticket should be assigned to L1 network support
team, if there is any doubt even after using the template, an incident task should created with L1
network team to check deeply and provide a quick feedback.

You might also like