Akash Teja
SUMMARY:
         Around 9+ Years of experience in Information Technology Industry which includes 5+Years of
          experience as Hadoop/Spark Developer using Bigdata Technologies like Hadoop Ecosystem, Spark
          Ecosystems and 3+Years of Java/J2EE Technologies and SQL.
         Hands on experience in installing, configuring and using Hadoop ecosystem components
          like HDFS , MapReduce Programming, Hive, Pig, Yarn, Sqoop, Flume, Hbase, Impala,
          Oozie, ZooKeeper , Kafka , Spark .
         In depth understanding of Hadoop Architecture including YARN and various components such
          as HDFS, Resource Manager, Node Manager, Name Node, Data Node and MR v1 & v2 concepts.
         In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark
          Streaming, Spark MLib and Spark Real time Streaming.
         Hands on experience in Analysis, Design, Coding and Testing phases of Software Development Life
          Cycle (SDLC).
         Hands on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2
          instances and Data Warehousing.
         Worked and learned a great deal from Amazon Web Services ( AWS ) Cloud services like EC2, S3, EBS
         Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small
          data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
         Hands on experience in various Bigdata application phases like data ingestion, data
          analytics and data visualization.
         Experience in usage of Hadoop distribution like Cloudera , Hortonworks distribution & Amazon AWS
         Experience in transferring data from RDBMS to HDFS and HIVE table using SQOOP.
         Migrating the coding from Hive to Apache Spark and Scala using Spark SQL, RDD.
         Experience in working with flume to load the log data from multiple sources directly into HDFS.
         Very well versed in workflow scheduling and monitoring tools such as Oozie, Hue and Zookeeper.
         Good knowledge on Impala, Mahout, SparkSQL , Storm , Avro , Kafka, Hue and AWS and
          knowledge on IDE tools such as Eclipse, NetBeans, and Maven.
         Installed and configured MapReduce, HIVE and the HDFS, implemented CDH5 and HDP clusters
          on CentOS. Assisted with performance tuning, monitoring and troubleshooting.
         Experience data processing like collecting, aggregating, moving from various sources using
          Apache Flume and Kafka.
         Proficient in data manipulation and analysis using Pandas, a powerful Python library for data handling
          and transformation.
         Experience in working with Pandas' data structures, including Series and DataFrame, for efficient data
          organization and analysis.
         Strong knowledge of version control systems like SVN and GITHUB.
         Experience in manipulating the streaming data to clusters through Kafka and Spark -Streaming.
         Experience in analyzing data using HiveQL, Pig Latin , and custom MapReduce programs in Java .
         Basic Knowledge on Kudu, Nifi, Kylin and Zeppelin with Apache Spark.
         Experience in NoSQL Column-Oriented Databases like Hbase, Cassandra and its Integration
          with Hadoop cluster.
         Involved in Cluster coordination services through Zookeeper.
         Good level of experience in Core Java, J2EE technologies as JDBC, Servlets, and JSP.
         Experienced in Robotic Process Automation (RPA) platforms like UIPath, Automation Anywhere, or
          Blue Prism, to automate repetitive tasks and increase operational efficiency.
         Hands-on experience in automating processes by interacting with various systems and applications
          using UIPath, including web scraping, data extraction, and report generation
                                               TECHNICAL SKILLS:
Programming Languages                           C, C++, Java 1.4/1.5/1.6/1.7/1.8 , Sql, Pl/Sql, Python
Big Data Technologies                            HDFS, Hive, Crunch, Oozie, Apache Hadoop, Spark, HIVE, PIG,
                                                 Hbase, SQOOP, Oozie, Zookeeper, Spark Mahout.
Web Technologies                                 HTML, HTML5, XML, XHTML, CSS3, JSON, AJAX, XSD, WSDL, ExtJS
RDBMS/Databases                                  Oracle, MySql, PostgreSQL, SQLServer, MongoDB (NoSQL),
                                                 ORACLE 8i/9i/10g, SQL Server 6.5, MS Access
Server side Frameworks and Libraries             Spring 2.5/3.0/3.2, Hibernate 3x/4x, MyBatis, Spring MVC, Spring
                                                 web flow, Spring Batch, Spring Integration, Spring-WS, Struts,
                                                 Jersey Restful Web services, Xfire, Apache CXF, Mule ESB,
                                                 Zookeeper, Curator, Apache POI, Junit, Mockito, PowerMock, Slf4j,
                                                 Log4j, Gson, Jackson, UML, Selenium, Crystal Reports
UI Frameworks and Libraries                      ExtJS, JQuery, JQueryUI, AngularJS, Thymeleaf, Prime Faces,
                                                 Bootstrap
Application Servers                              Bea WebLogic, IBM WebSphere, Apache Tomcat
Build Tools and IDE’s                            Maven, Ant, IntelliJ, Eclipse, Spring Tool Suite, NetBeans and
                                                 Jenkins
Operating Systems                                Windows, UNIX, SUN Solaris, Linux, Mac OS X
Tools                                            SVN, JIRA, Toad, SQL Developer, Serena Dimensions, Share point,
                                                 Clear Case, Perforce
Process & Concepts                               Agile, SCRUM, SDLC, Object-Oriented Analysis and Design, Test
                                                 driven Development, Continuous Integration
                                                Education Details:
                                     University: Campbellsville University
                                               Degree: Master
                               Branch: Information Technology and Management
                                       University: Jaipur National University
                                                Degree: Bachelors
                                            Branch: Computer Science
                                           PROFESSIONAL SUMMARY:
Walmart Sunnyvale, CA                                        October 2021 to July 2023
Sr Data Engineer
Responsibilities:
        Extracting & Analyzing data with Spark from varied sources; creating & developing
         algorithms for smooth functioning of the system & eliminating false data
        Designing, Creating, Testing & maintaining complete data management & processing
         systems on Airflow.
        Constructing intricate pipelines with complete automated process; building jars, testing &
         deploying using Git (CI/CD)
        Developed Spark code using Scala and Spark-SQL for faster processing and testing.
        Intensifying unit tests to minimize the issue & presenting a quality product; creating unit
         tests to check Python functions for their expected performance & actual performance.
        Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster
         processing of data.
        Built data processing pipelines and ETL (Extract, Transform, Load) workflows using Java and
         related frameworks.
        Creating utilities to automate manual work using Scala; reusing operators to lessen
         redundancy.
        Developed data ingestion processes by implementing custom Java applications to extract
         data from various sources, such as databases, APIs, and file systems.
        Expertise in setting up and managing Data Proc clusters to process large-scale data
         workloads efficiently. This includes configuring cluster specifications, scaling resources based
         on workload demands, and optimizing cluster performance for faster data processing.
        Structuring highly scalable, robust & fault-tolerant systems
        Determining data acquisitions opportunities & prospects; finding ways & methods to find
         value out of existing data.
        Utilized Java libraries for data serialization formats (e.g., Avro, Parquet) and worked with
         schema evolution to ensure compatibility and flexibility in data storage.
        Implemented performance optimizations, including indexing, caching, and query tuning, to
         enhance data retrieval and processing efficiency in Java applications.
        Familiarity with Java frameworks for web development (e.g., Spring, Java Servlets) and
         RESTful API design, enabling seamless integration of data services with other systems.
        Created multi-node Hadoop and Spark clusters in cloud instances to generate terabytes of
         data and stored it in GCP HDFS.
        Configured Spark streaming to receive data from Kafka and store the data to HDFS using
         Scala.
        Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for
         building the common learner data model which gets the data from Kafka in near real time.
        Developed Kafka consumer using Spark Structured Streaming to read from Kafka topics and
         write to GCP.
        Refining data quality, reliability & efficiency of the individual components & the complete
         system
        Fashioning a complete solution by integrating a variety of programming languages & tools
         together
         Developed interactive dashboards using Kibana and Looker for data visualization, analytics,
         and reporting purposes.
        Presenting new data management tools & technologies into the existing system to make it
         more effectual.
        Developed dashboards and reports using Looker's data modeling and visualization
         functionalities, including LookML modeling, dimensions, measures, and custom
         visualizations.
         Environment: HDFS, Spark, Hive, Sqoop, SQL, HBase, Scala, Python, GCP, Kafka, Airflow and
         Shell Scripting, Looker, Kibana.
Comcast, Philadelphia, PA                                        June 2021 to October 2021
Sr. Big Data Developer/Engineer
Responsibilities:
        Worked as a senior Big Data Cloud engineer for one of the largest Cable and Broadband
         supplier in the US.
        Took lead of different improvement projects utilizing Databricks and different AWS Services
         like Kinesis, Apache Spark Streaming, DynamoDB, and Elasticsearch.
        Developed and deployed data processing pipelines on the Amazon Web Services (AWS)
         platform using Python and Spark Streaming technologies.
        Implemented data processing and analysis workflows using AWS Glue, Amazon EMR, or
         Apache Spark with Python, handling large-scale datasets and performing complex
         transformations.
        Followed best practices in code organization, documentation, and version control using Git,
         resulting in maintainable and scalable Flask and FastAPI applications.
        Implemented data transformations and business logic using Java libraries and frameworks,
         ensuring data integrity, quality, and compliance with business requirements.
        Utilized Spark Streaming's windowing and sliding window operations to handle time-based
         aggregations and analytics on streaming data.
        Conducted performance tuning and optimization of Spark Streaming jobs to improve
         throughput, reduce latency, and enhance overall system efficiency.
        Developed different exceptionally complex Databricks jobs utilizing Spark Streaming
         (PySpark and Spark SQL) to deal with real-time data coming from AWS Kinesis and storing
         the final output in S3 buckets, DynamoDB, and AWS Elasticsearch.
        Worked to develop a testing framework known as DQ Checks utilizing Pyspark and the
         framework is capable of validating real-time data coming from SFTP or AWS Kinesis.
        Made CICD Pipeline on Jenkins utilizing GitHub Repo to generate release deployment
         management.
        Framework is additionally equipped for emailing and storing the final output in Athena Table
         for further analysis.
Environment: Amazon EC2. Spark, Python, AWS SDK for Python, Spark Streaming Applications, Base,
ZXoopkeeper, Mapreduce, Postman, Flume, NoSql, HDFS, Avro, Airflow, Sqoop, DynamoDB
Lowes Mooresville, NC                                      Oct 2018 to April 2021
Sr. Big Data Developer/Engineer
Responsibilities:
   Involved in story-driven Agile development methodology and actively participated in daily scrum
      meetings.
   Worked on all activities related to the development, implementation and support for Hadoop.
   Designed custom re-usable templates in Nifi for code reusability and interoperability.
   Involved in Installing, Configuring Hadoop Eco System, and Cloudera Manager using CDH4 Distribution.
   Build frameworks using Python in Airflow to orchestrate the Data Science pipelines
   Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, and
      Elastic Load Balancer, Auto scaling groups, VPC subnets and Cloud Watch.
   Responsible to manage data coming from different sources and involved in HDFS maintenance and
      loading of structured and unstructured data.
   Worked with Kafka streaming tool to load the data into HDFS and exported it into MongoDB database.
   Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
   Installed and Configured Apache Hadoop clusters for application development and Hadoop tools like
      Hive, Pig, HBase, Zookeeper and Sqoop.
   Implemented multiple MapReduce Jobs in java for data cleansing and pre-processing.
   Wrote complex Hive queries and UDFs in Java and Python.
   Worked on AWS provisioning EC2 Infrastructure and deploying applications in Elastic load balancing.
   Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the
      results for C-level decision makers
   Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
   Used Scala and Spark-SQL to develop spark code for faster processing, testing and performed complex
      Hive queries on Hive tables.
   Worked on Kerberization to secure the applications using SSL and SAML authentication
   Wrote and execute SQL queries to work with structured data available in relational databases and to
      validate the transformation/ business logic.
   Use Flume to move data from individual data sources to Hadoop system.
   Use MRUnit framework to test the MapReduce code.
   Responsible for building scalable distributed data solutions using Hadoop Eco system and Spark.
   Worked on performance testing the api’s using Postman
   Involved in the process of data acquisition, data pre-processing various types of source data using
      Stream sets.
   Responsible for design & development of Spark SQL Scripts using Scala/Java based on Functional
      Specifications.
   Analysed the data by performing Hive queries (HiveQL), ran Pig scripts, Spark SQL and Spark
      streaming.
   Developed tools using Python, Shell scripting, XML to automate some of the menial tasks
   Wrote scripts in Python for extracting data from HTML file.
   Implemented MapReduce jobs in HIVE by querying the available data.
   Configured Hive Meta store with MySQL, which stores the metadata for Hive tables.
   Performed data analytics in Hive and then exported those metrics back to Oracle Database using
      Sqoop.
   Performance tuning of Hive queries, MapReduce programs for different applications.
   Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
   Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
   Used Cloudera Manager for installation and management of Hadoop Cluster.
Environment: Nifi 1.1, Hadoop 2.6, JSON, XML, Avro, HDFS, Airflow Teradata r15, Sqoop, Kafka, MongoDB,
Hive 2.3, Pig 0.17, HBase, Zookeeper, MapReduce, Postman , java, Python 3.6, Yarn, Flume, NoSQL, Cassandra
3.11.
US Bank Minneapolis, MN                                                        Dec 2017 to Oct 2018
Hadoop Developer
Responsibilities:
     In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS,
        Application master, Node Manager, Resource Manager, Name Node, Datanode and MapReduce
        concepts.
     Imported required tables from RDBMS to HDFS using Sqoop and also used Storm and Kafka to get
        real time streaming of data into HBase.
     Good experience with NoSQL database Hbase and creating Hbase tables to load large sets of semi
        structured data coming from various sources.
     Wrote Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some
        pre-aggregations before storing into the HDFS.
     Developed data pipeline using Flume, Sqoop, Pig and MapReduce to ingest customer behavioural
        data and purchase histories into HDFS for analysis
     Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
     Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
     Involved in moving all log files generated from various sources to HDFS for further processing
        through Flume.
     Developed Java code to generate, compare & merge AVRO schema files.
     Prepared the validation report queries, executed after every ETL runs, and shared the resultant values
        with business users in different phases of the project.
     Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting &
        used the hive optimization techniques during joins and best practices in writing hive
        scripts using HiveQL.
     Importing and exporting data into HDFS and Hive using Sqoop. Writing the HIVE queries to extract
        the data processed
     Developing and running Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and
        monthly reports as per user's need.
     Teamed up with Architects to design Spark model for the existing MapReduce model and Migrated
        MapReduce models to Spark Models using Scala.
     Implemented Spark using Scala and utilizing SparkCore, Spark Streaming and SparkSQL API for faster
        processing of data instead of MapReduce in Java.
     Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and
        handled Structured data using Spark SQL
     Handled importing of data from various data sources, performed transformations using Hive,
        MapReduce , loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
     Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data
        from Kafka to HDFS.
     Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types
        of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
     Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Pig, Sqoop,
        Spark and Zookeeper.
     Expert knowledge on MongoDB NoSQL data modeling, tuning, disaster recovery and backup
Environment: Apache Hadoop, HDFS, MapReduce, HBase, Hive, Yarn, Pig, Sqoop, Flume, Zookeeper, Kafka,
Impala, SparkSQL, Spark Core, Spark Streaming, NoSQL, MySQL, Cloudera, Java, JDBC, Spring, ETL, WebLogic,
Web Analytics, Avro, Cassandra, Oracle, Shell Scripting, Ubuntu.
Zenmonics Hyderabad, India                                                       May 2012 to Nov 2015
Data Analyst
Responsibilities:
   Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in
      Python
   Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to
      perform data extraction and merging from large volumes of historical data stored in Oracle 11g,
      validating the ETL processed data in target database
   Good understanding of Teradata SQL Assistant, Teradata Administrator and data load Experience with
      Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, Pivot Tables and OLAP reporting.
   Identified the variables that significantly affect the target.
   Continuously collected business requirements during the whole project life cycle.
   Conducted model optimization and comparison using stepwise function based on AIC value
   Developed Python scripts to automate data sampling process. Ensured the data integrity by checking
      for completeness, duplication, accuracy, and consistency
   Generated data analysis reports using Matplotlib, Tableau, successfully delivered and presented the
      results for C-level decision makers
   Generated cost-benefit analysis to quantify the model implementation comparing with the former
      situation
   Worked on model selection based on confusion matrices, minimized the Type II error.
  Environment: Tableau 7, Python 2.6.8, Numpy, Pandas, Matplotlib, Scikit-Learn, MongoDB, Oracle 10g, SQL