Chandralekha Rao Yachamaneni
E-mail: chandralekhar264@gmail.com
Mobile: +1 (469)640-0633
https://www.linkedin.com/in/chandralekha-r-5239b9239/
PROFESSIONAL SUMMARY
● Overall 8+ years of professional experience in Information Technology and around 5 years of expertise in
BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation,
Deployment, and Integration using SQL and Big Data technologies.
● Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive,
Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
● Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and
Spark processing frameworks.
● Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions
and Data Warehouse tools for reporting and data analysis.
● Deployed the Big Data Hadoop application using Talend on cloud python etl (Amazon Web Services) and on
Microsoft Azure.
● Expertise in deploying cloud-based services with Amazon Web Services (Databases, Migration, Compute, IAM,
Storage, Analytics, Network & Content Delivery, Lambda and Application Integration).
● Excellent Proficiency in storage, compute and networking services with implementation experience in data
engineering using key AWS services such as VPC, EC2, S3, ELB, Auto Scaling Group(ASG), EBS, RDS, IAM, EFS,
CloudFormation, Redshift, DynamoDB, Glue, Lambda, Step Functions, Kinesis, Route 53, CloudWatch,
CloudFront, CloudTrail, SQS,SNS, SES, AWS Systems Manager etc.
● Developed data ingestion modules using AWS Step Functions, AWS Glue and Python modules.
● Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability,
efficiency, and quality.
● Created a multi-threaded Java applications running on master node for pulling other data feeds (Json and
XML) into S3.
● Handle streaming data in real time with Kafka, Flume and Spark Streaming. Knowledge on Flink
● Developed and deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and deployed
Lambda Functions in Scala with custom Libraries.
● Optimized Hive tables utilizing partitions and bucketing to give better execution Hive QL queries.
● Used Spark-SQL to read data from hive tables and perform various data cleansing, data validations,
transformations, and aggregations as per down stream business team requirements.
● Responsible for loading processed data to AWS Redshift table for allowing Business reporting team to build
dashboards.
● Created views in AWS Athena to allow secure and streamlined data analysis access to downstream business
teams.
● Worked extensively with Data Science team to help productionalize machine learning models and to build
various feature datasets as needed for data analysis and modelling.
● Experience implemented AWS Kinesis Firehose to sink data directly to Redshift and S3.
● Experience in converting AWS existing infrastructure to server less architecture (AWS Lambda, Kinesis) and
deployed AWS Cloud formation.
● Hands-on experience on working with AWS services like Lambda function, Athena, DynamoDB, Step functions,
SNS, SQS, S3, IAM etc.
● Worked extensively in automating launching of EMR clusters and terminating EMR cluster as soon as jobs are
finished.
● Excellent understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and
Microsoft Azure with Databricks workspace for Business Analytics, Manage Clusters in Databricks for
Managing the Machine Learning Lifecycle.
● Hands on experience on Python programming PySpark implementations in AWS EMR, building data pipelines
infrastructure to support deployments for Machine Learning models, Data Analysis and cleansing and using
Statistical models with extensive use of Python, Pandas, Numpy, Visualization using Matplotlib, Seaborn and
Scikit packages for predictions, xgboost.
● Worked on data visualization and analytics with research scientist and business stake holders.
● Experience in complete project life cycle (design, development, testing and implementation) of Client Server
and Web applications.
● Involved in technical analysis to identify the use of Sagemaker.
● Excellent programming skills with experience in Java, C, SQL, and Python Programming.
● Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
● Experienced in working in SDLC, Agile and Waterfall Methodologies.
● Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP,
Struts, Spring, Hibernate and Web services.
EDUCATION
Bachelor of Engineering.
TOOLS AND TECHNOLOGIES
BigData/Hadoop MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase,
Technologies Flume, Flink, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret,
Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala,
Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases Cassandra, HBase, MongoDB, MariaDB
Web Design Tools HTML, CSS, JavaScript, JSP, jQuery, XML
Development Tools Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.
Public Cloud EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Glue, Athena, SageMaker.
Orchestration tools Oozie, Airflow.
Development Agile/Scrum, UML, Design Patterns, Waterfall
Methodologies
Build Tools Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie,
Hue, SOAP UI
Reporting Tools MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.
Databases Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata,
Netezza
Operating Systems All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
Client: Aetna, Hartford, CT Oct 2020 to Current
Role: Sr Data Engineer/Java
Responsibilities:
● Evaluating client needs and translating their business requirement to functional specifications thereby
onboarding them onto the Hadoop ecosystem.
● Extracted and updated the data into HDFS using Sqoop import and export.
● Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a
serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
● On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda
function.
● Construct the AWS data pipelines using VPC, EC2, S3, Auto Scaling Groups (ASG), EBS, Snowflake, IAM,
CloudFormation, Route 53, CloudWatch, CloudFront, CloudTrail.
● Implementations of generalized solution model using AWS SageMaker.
● Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to
massage and transform and serialize raw data.
● Involved in writing Flink jobs to parse near real time data and then push to hive.
● Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a
serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
● Use Lambda functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline.
● Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
● Created various hive external tables, staging tables and joined the tables as per the requirement.
Implemented static Partitioning, Dynamic partitioning, and Bucketing.
● Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL
database for huge volume of data.
● Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data
coming from various sources.
● Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on
the Hive data.
● Implemented full CI/CD pipeline by integrating SCM (Git) with automated testing tool Gradle & Deployed using
Jenkins (Declarative Pipeline) and Dockerized containers in production and engaged in few DevOps tools like
Ansible, Chef, AWS CloudFormation, AWS Code pipeline, Terraform and Kubernetes.
● Responsible to manage data coming from different sources through Kafka.
● Installed Kafka Producer on different severs and scheduled to produce data for every 10 seconds.
● Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming data.
● Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
● Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
● Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
● Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
● Responsible in handling streaming data from web server console logs.
● Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
● Used Sqoop to import data into HDFS and Hive from other data systems.
● Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to
HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
● Involved in NoSQL database design, integration, and implementation.
● Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along with
components on HDFS, Hive.
● Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression,
Gradient Boosting Machine to build predictive model using scikit-learn package in Python.
● Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala.
● Experience developing Airflow workflows for scheduling and orchestrating the ETL process.
● Responsible for estimating the cluster size, monitoring, and troubleshooting of the spark databricks cluster.
● Migrated Map reduce jobs to Spark jobs to achieve better performance.
● Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and
debugging.
Environment: Hadoop (HDFS, MapReduce), AWS Services (Lambda, EMR, Autoscaling, Glue, S3, Redshift, Athena),
Scala, Databricks, Yarn, IAM, Flink, PostgreSQL, Spark, Impala, Azure, Hive, Mongo DB, Pig, DevOps, HBase, Oozie, Hue,
Sqoop, Flume, Oracle, NIFI, Git.
Client: Neilson Dairy, Oldsmar, FL Sep 2019 to Oct
2020
Role: Hadoop developer/Java
Responsibilities:
● Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS,
processing the data in HDFS and analyzing the data and involved.
● Configured Flume to extract the data from the web server output files to load into HDFS.
● Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile
and network devices and pushed into HDFS.
● Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction,
transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover
insights into the customer usage patterns.
● Create external tables with partitions using Hive, AWS Athena, and Redshift.
● Development of Spark structured streaming to read the data from Kafka in real time and batch modes, apply
different mode of Change data captures (CDCs) and then load the data into Hive.
● Developing environment, Confidential S3, EC2, Glue, Athena, AWS Data Pipeline, Kinesis streams, Firehose,
Lambda, Redshift, RDS, and Dynamo DB integration. Created a React client web-app backed by serverless AWS
Lambda functions to LINKS Interact with an AWS Sagemaker Endpoint.
● Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications
utilizing the AWS stack (Including EC2, RDS) by focusing on high-availability and auto-scaling.
● Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3,
RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud
Formation.
● Creating Spark clusters and configuring high concurrency clusters using Databricks to speed up the preparation
of high-quality data.
● Created workflow using Step functions to orchestrate Glue jobs and its dependencies.
● Working experience with data streaming process with Kafka, Apache Spark, Hive.
● Writing to Glue metadata catalog which in turn enables us to query the refined data from Athena achieving a
serverless querying environment.
● Worked on Enterprise Messaging Bus with Kafka-Tibco connector and published Queues were abstracted
using Spark Dstreams and parsed XML, JSON data in Hive.
● Designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second.
Used Kafka producer 0.6.3 APIs to produce messages.
● Implementations of generalized solution model using AWS SageMaker.
● Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of
Data Frame and save the data as Parquet format in HDFS.
● Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD
and Used Kafka to load data into HDFS and NoSQL databases.
● Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
● Development of Spark structured streaming to read the data from Kafka in real time and batch modes, apply
different mode of Change data captures (CDCs) and then load the data into Hive.
● Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats
like Snappy, bzip2.
● Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
● Process the data from Kafka pipelines from topics and show the real time streaming in dashboards.
● Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
● Analyzed the SQL scripts and designed the solution to implement using Scala.
● Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled
structured data using Spark SQL.
● Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using
Spark context, Spark-SQL, PostgreSQL, Scala, Data Frame, Impala, OpenShift, Talend, pair RDD's.
● Setup data pipeline using in TDCH, Talend, Sqoop and PySpark on the basis on size of data loads.
● Implemented Real time analytics on Cassandra data using thrift API.
● Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and
exported the data to Cassandra.
● Leading the testing efforts in support of projects/programs across a large landscape of technologies ( Unix,
Angular JS, AWS, sause LABS, Cucumber JVM, Mongo DB, GitHub, Bitbucket, SQL, NoSQL database, API, Java,
Jenkins
● Queried and analyzed data from Cassandra for quick searching, sorting, and grouping through CQL.
● Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
● Worked on Apache NIFI to decompress and move JSON files from local to HDFS.
● Experience on moving raw data between different systems using Apache NIFI.
● Used Elasticsearch for indexing/full text searching.
● Code and developed a custom Elastic Search java-based wrapper client using the JEST API.
Environment: Hadoop (HDFS, Map Reduce), Databricks, Spark, AWS Services EC2, S3, Glue, Athena, EMR, Redshift,
Talend, Impala, Hive, PostgreSQL, Flink, Jenkins, Nifi, Scala, Mongo DB, Cassandra, Python, Pig, Sqoop, Hibernate,
spring, Oozie, Autoscaling, Scala, Azure, Elastic Search, DynamoDB, UNIX Shell Scripting, TEZ, Google Analytics.
Client: Directv, El Segundo, CA Oct 2018 to Aug 2019
Role: Data Engineer/Java
Responsibilities:
● Designed end to end scalable architecture to solve business problems using various Azure Components like
HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
● Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the
SQL Activity.
● Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation
from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage
patterns.
● Written multiple Hive UDFS using Core Java and OOP concepts and spark functions within Python programs.
● Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from
multiple file formats including XML, JSON, CSV and other compressed file formats.
● Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized
applications without container orchestration expertise.
● Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to
their requirement.
● Used Azure Event Grid for managing event service that enables you to easily manage events across many
different Azure services and applications.
● Used Service Bus to decouple applications and services from each other, providing the benefits like Load-
balancing work across competing workers.
● Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
● Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and
reproducible machine learning experiments.
● Delta lake supports merge, update and delete operations to enable complex use cases.
● Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
● Used Databricks to integrate easily with the whole Microsoft stack.
● Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate the monthly account level
customer data.
● Creating Spark clusters and configuring high concurrency clusters using Azure Databricks (ADB) to speed up
the preparation of high-quality data.
● Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time
analytics streaming, sqoop, pig, hive and CosmosDB for batch jobs.
● Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and
processing the data in In Azure Databricks.
● Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.
● Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare,
manage, and serve data for immediate BI and machine learning needs.
● Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
● Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT
scan pictures to figure out the disease in CT scan.
● Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into
HDFS.
● Created Session Beans and controller Servlets for handling HTTP requests from Talend.
● Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports
including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
● Wrote documentation for each report including purpose, data source, column mapping, transformation, and
user group.
● Utilized Waterfall methodology for team and project management.
● Used Git for version control with Data Engineer team and Data Scientists colleagues.
Environment: Ubuntu 16.04, Hadoop 2.0, Spark (PySpark, Spark streaming, SparkSQL, SparkMLlib), Nifi, Jenkins, Pig
0.15, Python 3.x(Nltk, Pandas), Tableau 10.3, GitHub, Azure (Storage, DW ,ADF,ADLS,Databricks),AWS Redshift and
OpenCV.
Client: BERKADIA, Hyderabad, India Aug 2014 to Sep 2018
Role: Data Analyst
Responsibilities:
● Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify
workable items for further development.
● Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for
reporting purpose by Pig.
● Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured
and stored in AWS Redshift.
● Deployed Machine Learning models for item-item similarity on Amazon SageMaker (AWS).
● Deploy services on AWS and utilized step function to trigger the data pipelines.
● Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further
analysis.
● Creation of catalog database tables for Athena for fast querying S3 data and setting up for query performance
at Athena without moving data into Redshift.
● Used the AWS Sage Maker to quickly build, train and deploy the machine learning models.
● Created plugins to extract data from multiple sources like Apache Kafka, Database and Messaging Queues.
● Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.
● Developing parser and loader MapReduce application to retrieve data from HDFS and store to HBase and Hive.
● Experienced in setting up Multi-hop, Fan-in, and Fan-out workflow in Flume.
● Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
● Implemented Custom Sterilizer, interceptors to Mask, created confidential data and filter unwanted records
from the event payload in Flume.
● Configured, designed, implemented, and monitored Kafka cluster and connectors.
● Responsible for ingesting large volumes of IOT data to Kafka.
● Wrote Kafka producers to stream the data from external rest APIs to Kafka topics
● Worked with teams to use KSQL for real-time analytics.
● Worked with multiplexing, replicating and consolidation in Flume.
● Used OOZIE operational Services for batch processing and scheduling workflows dynamically.
Environment: Spark (PySpark, SparkSQL, Spark Streaming, SparkMLIib), Kafka, Python 3.x(Scikit-learn, Numpy,
Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift/Glue, and Pig.
Client: Sonata Software Ltd, India Aug 2013 to July
2014
Role: Java Developer
Responsibilities:
● Involved in requirement analysis, design, coding, and implementation phases of the project.
● Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
● Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames
and Spark SQL APIs.
● Written new spark jobs in Scala to analyze the data of the customers and sales history.
● Used Kafka to get data from many streaming sources into HDFS.
● Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS
for further analysis.
● Created multi - tier java based multiple web services to read data from MongoDB.
● Deployed Spark application and java web services in pivotal cloud foundry.
● Good experience in Hive partitioning, Bucketing and performed different types of joins on Hive tables.
● Created Hive external tables to perform ETL on data that is generated on daily basics.
● Written HBase bulk load jobs to load processed data to HBase tables by converting to HFiles.
● Performed validation on the data ingested to filter and cleanse the data in Hive.
● Created Sqoop jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
● Loaded the data into hive tables from spark and used ORC columnar format.
● Developed oozie workflows to automate and productionize the data pipelines.
● Developed Sqoop import Scripts for importing reference data from Teradata.
● Worked on triggers and stored procedures on Oracle database.
● Worked on Eclipse IDE to write the code and integrate the application.
● Application was deployed on WebSphere Application Server.
● Coordinated with testing team for timely release of product own.
● Apache ANT was used for the entire build process.
● JUnit was used to implement test cases for beans.
Environment: Java, JSP, Servlets, JMS, JavaScript, Eclipse, WebSphere, PL/SQL, Oracle, Log4j, JUnit, ANT, Clear-case,
Windows.