Mastering
Sqoop
for
Data
Transfer
for
Big
Data
Jarek
Jarcec
Cecho
|
Kathleen
Ting
1
Who
Are
We?
• Jarek
Jarcec
Cecho
• Apache
Sqoop
Commi?er,
PMC
Member
• SoCware
Engineer,
Cloudera
• jarcec@apache.org
• Kathleen
Ting
• Apache
Sqoop
Commi?er,
PMC
Member
• Customer
OperaLons
Engineering
Manager,
Cloudera
• kathleen@apache.org,
@kate_Lng
2
What
is
Sqoop?
• Apache
Top-‐Level
Project
• SQl
to
hadOOP
• Tool
to
transfer
data
from
relaLonal
databases
• Teradata,
MySQL,
PostgreSQL,
Oracle,
Netezza
• To
Hadoop
ecosystem
• HDFS
(text,
sequence
file),
Hive,
HBase,
Avro
• And
vice
versa
3
Why
Sqoop?
• Efficient/Controlled
resource
uLlizaLon
• Concurrent
connecLons,
Time
of
operaLon
• Datatype
mapping
and
conversion
• AutomaLc,
and
User
override
• Metadata
propagaLon
• Sqoop
Record
• Hive
Metastore
• Avro
4
Sqoop
1
5
Sqoop
1
• Based
on
Connectors
• Responsible
for
Metadata
lookups,
and
Data
Transfer
• Majority
of
connectors
are
JDBC
based
• Non-‐JDBC
(direct)
connectors
for
opLmized
data
transfer
• Connectors
responsible
for
all
supported
funcLonality
• HBase
Import,
Avro
Support,
...
6
Sqoop
1
Challenges
• CrypLc,
contextual
command
line
arguments
• Security
concerns
• Type
mapping
is
not
clearly
defined
• Client
needs
access
to
Hadoop
binaries/configuraLon
and
database
• JDBC
model
is
enforced
7
Sqoop
1
Challenges
• Non-‐uniform
funcLonality
• Different
connectors
support
different
capabiliLes
• Overlapped/Duplicated
funcLonality
• Different
connectors
may
implement
same
capabiliLes
differently
• High
coupling
with
Hadoop
• Database
vendors
required
to
understand
Hadoop
idiosyncrasies
in
order
to
build
connectors.
8
Sqoop
2
9
Sqoop
2
–
Design
Goals
• Security
and
SeparaLon
of
Concerns
• Role
based
access
and
use
• Ease
of
extension
• No
low-‐level
Hadoop
knowledge
needed
• No
funcLonal
overlap
between
Connectors
• Ease
of
Use
• Uniform
funcLonality
• Domain
specific
interacLons
10
Sqoop
2:
ConnecLon
vs
Job
metadata
There
are
two
disLnct
sets
of
opLons
to
pass
into
Sqoop:
Connection (distinct per database) Job (distinct per table)
11
Sqoop
2:
Workings
• Connectors
register
metadata
• Metadata
enables
creaLon
of
ConnecLons
and
Jobs
• ConnecLons
and
Jobs
stored
in
Metadata
Repository
• Operator
runs
Jobs
that
use
appropriate
connecLons
• Admins
set
policy
for
connecLon
use
12
Sqoop
2:
Security
• Support
for
secure
access
to
external
systems
via
role-‐based
access
to
connecLon
objects
• Administrators
create/edit/delete
connecLons
• Operators
use
connecLons
13
Current
Status:
Sqoop
2
• Primary
focus
of
the
Sqoop
Community
• First
cut:
1.99.1
• bits
and
docs:
h?p://sqoop.apache.org/
14
Demo
15
16