8000 dynamodb upload script and splitter script · premaseem/pythonLab@fc61986 · GitHub
[go: up one dir, main page]

Skip to content

Commit fc61986

Browse files
Aseem JainAseem Jain
authored andcommitted
dynamodb upload script and splitter script
1 parent 8615d9a commit fc61986

File tree

11 files changed

+1600
-0
lines changed

11 files changed

+1600
-0
lines changed
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
Load Troy Hunt blacklisted password data set in dynamoDB
2+
========================================================
3+
4+
Steps:
5+
------
6+
7+
* Download Troy Hunt data set which is around 9 GB.
8+
* Extract downloaded file (will be around 22 GB).
9+
* Split single file to manage it better in 500 MB files.
10+
* Prune or trim data to pick passwords which were compromised more then configurable threshold number.
11+
* Load data batches in dynamoDB using python script.
12+
* This script loads data and assumes that expected table already exists before you run it.
13+
14+
15+
Download Troy Hunt data set:
16+
===========================
17+
18+
https://haveibeenpwned.com/Passwords
19+
SHA-1 Version 3 (ordered by hash) 13 Jul 2018 9.18GB 10c001292d52a04dc0fb58a7fb7dd0b6ea7f7212
20+
21+
Extract data:
22+
============
23+
24+
Extract / unzip the file, it will provide a single 22 GB file.
25+
Each line of file woould look like
26+
20EABE5D64B0E216796E834F52D61FD0B70332FC:2298084
27+
28+
Split single file:
29+
=================
30+
31+
Split single file to manage it better in 500 MB files
32+
$ split -b <size in bytes> sourceFilePath splittedFilesPath
33+
eg. split -b 26214400 pwned-passwords-ordered-by-hash.txt ./splitedFiles/blacklist
34+
35+
Prune or trim data
36+
==================
37+
38+
# configure threshold value in var
39+
THRESHOLD_LIMIT=5000
40+
41+
# use gawk command to prune or trim file (install gawk package if required)
42+
43+
$ gawk 'BEGIN {FS=":"} {if($2>'$THRESHOLD_LIMIT') {print $1":"$2 } }' sourceFile > sourceFile-pruned.txt
44+
eg. gawk 'BEGIN {FS=":"} {if($2>'$THRESHOLD_LIMIT') {print $1":"$2 } }' blacklistFile > blackListFile-pruned.txt
45+
46+
Move pruned file
47+
================
48+
49+
Validate all files are pruned or trimmed or filtered based on threshold limit.
50+
Once all files are pruned, move them to a folder so that it can be used as input for data loader script.
51+
$ mkdir -p ./prunedInputFile
52+
eg. mv *-pruned.txt ./prunedInputFile
53+
54+
Configure AWS dynamoDb details
55+
==============================
56+
please provide details in `config.properties` and logging levels in `logging.properties`
57+
These files are located in resources folder. The default relative path would look like:
58+
./resources/config.properties
59+
./resources/logging.properties
60+
61+
Custom location of above configuraiton can be passed using the optional flag or argument to the script
62+
-c for path of config.properties
63+
-l for path of logging.properties
64+
65+
Note: It is assumed that aws configs are already set in ~/.aws folder with `aws_secret_access_key` and id
66+
67+
Sample config.properties
68+
=======================
69+
[dynamodb]
70+
endpoint_url = http://localhost:8000
71+
region = us-west-2
72+
table_name = password_blacklist
73+
74+
Sample logging.properties
75+
=======================
76+
[loggers]
77+
keys=root
78+
79+
[handlers]
80+
keys=console
81+
82+
[formatters]
83+
keys=simple
84+
85+
[logger_root]
86+
handlers=console
87+
level=INFO
88+
89+
[handler_console]
90+
class=StreamHandler
91+
formatter=simple
92+
args=(sys.stdout,)
93+
94+
[formatter_simple]
95+
format=%(asctime)s %(levelname)s %(message)s
96+
class=logging.Formatter
97+
98+
Create table in dynamoDb
99+
========================
100+
101+
Table could be created using web console ( better control)
102+
103+
Below command can be used to create table using SDK
104+
TABLE_NAME=pwd_blacklist
105+
KEY_NAME=pwd_hash
106+
107+
$ aws dynamodb create-table --table-name $TABLE_NAME --attribute-definitions AttributeName=$KEY_NAME,AttributeType=S --key-schema AttributeName=$KEY_NAME,KeyType=HASH --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 --endpoint-url http://localhost:8000
108+
109+
# Validate if table is created before inserting data
110+
$ aws dynamodb list-tables --endpoint-url http://localhost:8000
111+
112+
Inserting data in DynamoDB
113+
==========================
114+
115+
Please make sure, you have python 3 enviornment and load required pip modules
116+
$ pip install requirements.txt
117+
118+
$ python data_loader.py -f <path of data file with troy password data set>
119+
eg. python data_loader.py -f /Users/asee2278/dataloding/demo/splitedFiles/prunedInputFile/blacklistaa-pruned.txt
120+
121+
Optionally path of config and logging can be passed as arguments as well
122+
eg. python data_loader.py -f ~/input.txt -c ~/myconfig.properties -l ~/mylogging.preperties
123+
124+
Consolidated report
125+
===================
126+
At the end of script you will see a consolidated report like below:
127+
128+
/Users/asee2278/virtualEnvironments/p2/bin/python /Users/asee2278/idmCode/aseemFork/cloud-identity-client-scripts/dynamodb_scripts/data_loader.py -f input.txt
129+
2018-10-09 20:48:09,975 INFO ***** Consolidated report of data insertion for the input file input.txt
130+
2018-10-09 20:48:09,975 INFO Number of records 35000
131+
2018-10-09 20:48:09,975 INFO Number of records inserted 35000
132+
2018-10-09 20:48:09,975 INFO Number of records failed to insert 0
133+
134+
135+
136+
137+
aws dynamodb list-tables --endpoint-url https://dynamodb.us-east-2.amazonaws.com
138+
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
4+
import argparse
5+
import boto3
6+
import configparser
7+
import logging.config
8+
import os.path
9+
from botocore.exceptions import ClientError, ParamValidationError
10+
import datetime
11+
total_records_in_file = 0
12+
total_inserted_records = 0
13+
total_failed_records = 0
14+
15+
16+
def increment_error_count():
17+
global total_failed_records
18+
total_failed_records += 1
19+
20+
21+
def load_data(input_file, dynamo_table):
22+
try:
23+
with open(input_file) as f:
24+
for line in f:
25+
global total_records_in_file
26+
total_records_in_file += 1
27+
28+
# get prepared item
29+
item = prepare_item(line)
30+
31+
if item:
32+
insert_data_in_dynamodb(item, dynamo_table)
33+
else:
34+
logging.error("Skipped adding line due to error " + str(line))
35+
increment_error_count()
36+
except Exception as e:
37+
print(e)
38+
39+
40+
def insert_data_in_dynamodb(item, table):
41+
try:
42+
logging.debug("inserting " + str(item))
43+
table.put_item(Item=item)
44+
global total_inserted_records
45+
total_inserted_records += 1
46+
if total_inserted_records % 100 == 0:
47+
print(".",end="")
48+
49+
except ParamValidationError as e:
50+
increment_error_count()
51+
logging.error("Parameter validation error: %s" % e)
52+
53+
except ClientError as e:
54+
increment_error_count()
55+
if e.response['Error']['Code'] == 'EntityAlreadyExists':
56+
logging.error("Password already exists in Database")
57+
else:
58+
logging.error(e.response['ResponseMetadata']['RequestId'])
59+
logging.error(e.response['Error']['Message'])
60+
61+
62+
def prepare_item(line):
63+
item = None
64+
try:
65+
attr_values = line.strip().split(':')
66+
pass_hash = attr_values[0]
67+
count = attr_values[1]
68+
item = {
69+
'pwd_hash': pass_hash,
70+
'count': parse_int(count)
71+
}
72+
except Exception as e:
73+
logging.error("Error occurred while parsing " + str(attr_values))
74+
logging.error(e)
75+
return item
76+
77+
78+
def parse_int(value):
79+
try:
80+
return int(value)
81+
except ValueError:
82+
pass
83+
# in case int cannot be parsed 0 will be returned as count
84+
return 0
85+
86+
87+
88+
if __name__ == "__main__":
89+
parser = argparse.ArgumentParser(description="DynamoDb data loader")
90+
# parser.add_argument('-f', dest='input_file_path',default="10.million.10.txt",
91+
parser.add_argument('-f', dest='input_file_path',default="input-10.txt",
92+
help='please provide path of input file to load data')
93+
parser.add_argument('-c', dest='config_file_path',
94+
default='resources/config.properties',
95+
help='please provide path of config.properties file')
96+
parser.add_argument('-l', dest='log_config_file_path',
97+
default='resources/logging.properties',
98+
help='please provide path of logging.properties file')
99+
100+
config = configparser.ConfigParser()
101+
inputs = parser.parse_args()
102+
103+
if inputs.input_file_path is None:
104+
print("Please use -f to pass input file path")
105+
quit()
106+
107+
if not os.path.isfile(inputs.config_file_path) or not os.path.isfile(
108+
inputs.log_config_file_path):
109+
print(
110+
"Please provide valid config file path for arg -c `config.properties` and arg -l 'logging.properties")
111+
quit()
112+
113+
try:
114+
config.read([inputs.config_file_path])
115+
logging.config.fileConfig(inputs.log_config_file_path)
116+
except Exception as ex:
117+
print("invalid path for configuration files")
118+
logging.error(ex)
119+
quit()
120+
121+
endpoint = config['dynamodb']['endpoint_url']
122+
region = config['dynamodb']['region']
123+
table_name = config['dynamodb']['table_name']
124+
125+
dynamodb = boto3.resource('dynamodb', region_name=region,
126+
endpoint_url=endpoint)
127+
table = dynamodb.Table(table_name)
128+
start_time = datetime.datetime.now()
129+
load_data(inputs.input_file_path, table)
130+
131+
end_time = datetime.datetime.now()
132+
time_to_upload = (end_time - start_time)
133+
logging.info(
134+
"***** Consolidated report of data insertion for the input file {}".format(
135+
inputs.input_file_path))
136+
137+
logging.info("Number of records {}".format(total_records_in_file))
138+
logging.info(
139+
"Number of records inserted {}".format(total_inserted_records))
140+
logging.info(
141+
"Number of records failed to insert {}".format(total_failed_records))
142+
logging.info( "Time Started: {}".format(start_time))
143+
logging.info( "Time Ended {}".format(end_time))
144+
logging.info( "Time taken {}".format(time_to_upload))
145+
os.rename(inputs.input_file_path, inputs.input_file_path + "-done" )
146+
147+
# logging.info( "Upload rate per second is {}".format(total_inserted_records//time_to_upload))
148+

0 commit comments

Comments
 (0)
0