add data ingestion code #1

vuppalli · 2020-06-05T19:37:17Z

The dirty data script and corresponding test file are in the data-ingestion folder. We are looking forward to you feedback :)

bradmiro

Excellent work! I left some comments here for ya, feel free to add your own comments if anything I suggested didn't make sense.

bradmiro · 2020-06-05T19:51:55Z

data-science-onramp/data-ingestion/setup.py

@@ -0,0 +1,149 @@
+from random import choice, choices, randint, seed


Since you have several other helper functions that you've written, it might be harder to distinguish between a function imported here and one of the ones you've written. I might suggest just doing "import random" and calling "random.func" to make this a bit clearer.

bradmiro · 2020-06-05T20:01:55Z

data-science-onramp/data-ingestion/setup.py

+try:
+    sys.argv[2]
+    upload = False
+except IndexError:
+    print("Results will be uploaded to BigQuery")


Might be better to replace this by checking the length of the array instead. (if array length > 1 then...)

bradmiro · 2020-06-05T20:03:57Z

data-science-onramp/data-ingestion/setup.py

+    '''Manipulates the gender string'''
+    return choice([s, s.upper(), s.lower(),
+                  s[0] if len(s) > 0 else "",
+                   s[0].lower() if len(s) > 0 else ""])


Check the formatting here, should all be lined up.

running black on everything will be a fast format fix

bradmiro · 2020-06-05T20:11:32Z

data-science-onramp/data-ingestion/setup.sh

@@ -0,0 +1,6 @@
+# Submit a PySpark job via the Cloud Dataproc Jobs API


I would add a comment at the top with something along the lines of "requires having CLUSTER_NAME and BUCKET_NAME set in your environment"

bradmiro · 2020-06-05T20:16:54Z

data-science-onramp/data-ingestion/setup-test.py

+    try:
+        operation = cluster_client.delete_cluster(project, region,
+                                                  cluster_name)
+        operation.result()
+    except GoogleAPICallError:
+        pass


I would not catch this. If it fails it should fail loudly.

+1 - doing an exception of a pass is a no-no - even if you want to ignore an error, you should let the user know you're doing something.

As an addition - I'm not sure what you're checking for here. Is this the failure that's thrown when it can't delete it because it's already been deleted? If that's the case, add a clarifying comment and instead of the pass, a print statement that says "Cluster already deleted"

bradmiro · 2020-06-05T20:34:13Z

data-science-onramp/data-ingestion/setup-test.py

+    result = job_client.submit_job(project_id=project, region=region,
+                                   job=job_details)
+
+    job_id = result.reference.job_id
+    print('Submitted job \"{}\".'.format(job_id))
+
+    # Wait for job to complete
+    wait_for_job(job_client, job_id)
+
+    # Get job output
+    cluster_info = cluster_client.get_cluster(project, region, cluster_name)
+    bucket = storage_client.get_bucket(cluster_info.config.config_bucket)
+    output_blob = (
+        'google-cloud-dataproc-metainfo/{}/jobs/{}/driveroutput.000000000'
+        .format(cluster_info.cluster_uuid, job_id))
+    out = bucket.blob(output_blob).download_as_string().decode("utf-8")


The way to do this has actually JUST been updated about a week ago to be a bit easier to work with. Apologies as I haven't had the chance to update the samples yet. You can use this method

Something like this should work (might need tweaking):

operation = job_client.submit_job_as_operation(project_id=project, region=region, job=job_details) # This will wait for the job to finish before continuing. result = operation.result() output_location = result.driver_output_resource_uri + ".000000000" output = bucket.blob(output_location).download_as_string().decode("utf-8")

bradmiro · 2020-06-05T20:36:11Z

data-science-onramp/data-ingestion/setup-test.py

+def callback(operation_future):
+    '''Sets a flag to stop waiting'''
+    global waiting_cluster_callback
+    waiting_cluster_callback = False
+
+
+def wait_for_cluster_creation():
+    '''Waits for cluster to create'''
+    while True:
+        if not waiting_cluster_callback:
+            break
+
+
+def wait_for_job(job_client, job_id):
+    '''Waits for job to finish'''
+    while True:
+        job = job_client.get_job(project, region, job_id)
+        assert job.status.State.Name(job.status.state) != "ERROR"
+
+        if job.status.State.Name(job.status.state) == "DONE":
+            return


Per my earlier comment, you can delete this.

bradmiro · 2020-06-05T20:39:37Z

data-science-onramp/data-ingestion/setup-test.py

+        pass
+
+
+def test_setup(capsys):


All of your setup code should go into your teardown function above (perhaps renamed to something like "setup_teardown". Based on wherever you put the yield, all code above it will run BEFORE executing the test, and all code after it will run AFTER. See here: dataproc_quickstart

In your code, I would create the cluster and GCS buckets in your setup/teardown function. I would leave the test function itself to only include elements of the test that involve submitting the actual job.

ah yep - I see Brad called this out already 👍

bradmiro · 2020-06-05T20:55:04Z

data-science-onramp/data-ingestion/setup-test.py

+    cluster = cluster_client.create_cluster(project, region, cluster_data)
+    cluster.add_done_callback(callback)
+
+    # Wait for cluster to provision
+    global waiting_cluster_callback
+    waiting_cluster_callback = True


The docs are slightly misleading, as you can actually just do the following to accomplish the same thing:

operation = cluster_client.create_cluster(project, region, cluster_data) result = operation.result() #This is blocking and will not proceed with the rest of the code until completion.

I defer to Brad on this for how to actually do it, but I'd like to see this accomplished without a global

This should be ok to be removed altogether

bradmiro · 2020-06-05T20:57:02Z

data-science-onramp/data-ingestion/setup.py

+def dirty_data(proc_func, allow_none):
+    '''Master function returns a user defined function
+    that transforms the column data'''
+    def udf(col_value):
+        seed(hash(col_value) + time_ns())
+        if col_value is None:
+            return col_value
+        elif allow_none:
+            return random_select([None, proc_func(col_value)],
+                                 cum_weights=[0.05, 1])
+        else:
+            return proc_func(col_value)
+    return udf


Does this require having a nested function?

We have a nested function here because a User Defined Function can only take 1 argument (the column value). If we do not have this nested function, we will have repetitive code of calling the UDF for each column or have an unwieldy list comprehension. Which option do you think is best?

leahecole

Awesome first pass - it's great to have you coding and to see what you've come up with. Also, pytest fixtures are NOT easy so def give yourselves a pat on the back for getting fixtures that run at all! :)

I think that setup.py can probably be reorganized in a way that makes the test much simpler to run. I'd like to see your code in setup.py laid out as follows:

global variables at the top
helper functions, probably should be private functions (preceded by an underscore in their name)
some kind of main function, though it doesn't have to be called main, that goes through and executes everything needed to dirty the data - the BQ checks you have at the beginning, the actual data dirtying, and the saving of the results.

Then, when you are testing, your will have

fixture to create/teardown the dataproc cluster
fixture to create/teardown the gcs bucket (possibly including uploading what you need to upload
test that executes your "main" function (or whatever it's called) and calls assertions on the output
Nit - setup-test.py should be renamed setup_test.py to match the conventions of the rest of the upstream repo

leahecole · 2020-06-05T21:13:22Z

data-science-onramp/data-ingestion/setup-test.py

+project = os.environ['GCLOUD_PROJECT']
+region = "us-central1"
+zone = "us-central1-a"
+cluster_name = 'setup-test-{}'.format(str(uuid.uuid4()))


Nit - for the next two lines, f-strings are now preferred Python practice over .format.

BUT YAY FOR UUIDS

leahecole · 2020-06-05T21:15:20Z

data-science-onramp/data-ingestion/setup-test.py

+
+# Set global variables
+project = os.environ['GCLOUD_PROJECT']
+region = "us-central1"


Should we consider making the region + zone environment variables as well? Or include a TODO for users to update these to reflect their project? Not all folks are US based.

@bradmiro I know some products just straight up aren't available in other regions/zones, is Dataproc one?

I've personally always hard-coded regions into tests as these don't typically make their way into tutorials. For the cluster to be used in the tutorial, I think a set of steps showing how to create a Dataproc cluster in a preferred region is appropriate.

I believe Dataproc is available in any region that Compute instances are available.

leahecole · 2020-06-05T21:16:24Z

data-science-onramp/data-ingestion/setup-test.py

+
+waiting_cluster_callback = False
+
+# Set global variables


Python style thing - global variables should be all caps (project -> PROJECT, region -> REGION, zone -> ZONE)

leahecole · 2020-06-05T21:21:41Z

data-science-onramp/data-ingestion/setup-test.py

+
+
+@pytest.fixture(autouse=True)
+def teardown():


This fixture only does teardown, but I think it should be renamed and actually should be a setup and a teardown fixture, with cluster creation happening above the yield, and the ID or name of the created cluster (or maybe the operation?) being passed to the test

leahecole · 2020-06-05T21:22:23Z

data-science-onramp/data-ingestion/setup-test.py

+    try:
+        operation = cluster_client.delete_cluster(project, region,
+                                                  cluster_name)
+        operation.result()
+    except GoogleAPICallError:
+        pass


As an addition - I'm not sure what you're checking for here. Is this the failure that's thrown when it can't delete it because it's already been deleted? If that's the case, add a clarifying comment and instead of the pass, a print statement that says "Cluster already deleted"

leahecole · 2020-06-05T21:30:11Z

data-science-onramp/data-ingestion/setup.py

+    '''Manipulates the gender string'''
+    return choice([s, s.upper(), s.lower(),
+                  s[0] if len(s) > 0 else "",
+                   s[0].lower() if len(s) > 0 else ""])


running black on everything will be a fast format fix

leahecole · 2020-06-05T21:30:23Z

data-science-onramp/data-ingestion/setup.py

+                   s[0].lower() if len(s) > 0 else ""])
+
+
+def convertAngle(angle):


nit - snake case convertAngle -> convert_angle

leahecole · 2020-06-05T21:30:36Z

data-science-onramp/data-ingestion/setup.py

+    new_angle = str(degrees) + u"\u00B0" + \
+        str(minutes) + "'" + str(seconds) + '"'
+    return random_select([str(angle), new_angle], cum_weights=[0.55, 1])
+


nit spell out cumulative

leahecole · 2020-06-05T21:31:47Z

data-science-onramp/data-ingestion/setup.sh

@@ -0,0 +1,6 @@
+# Submit a PySpark job via the Cloud Dataproc Jobs API


leahecole · 2020-06-05T21:32:20Z

data-science-onramp/data-ingestion/setup.py

+    return choice([name, name.replace("&", "/")])
+
+
+def usertype(user):


nit - snake case usertype -> user_type

vuppalli · 2020-06-08T15:47:53Z

t

Thank you for these comments! The setup.py file is never run locally (only on the dataproc clusters) so we cannot call the main function specifically because we submit the entire file as a job. But, we can still restructure the setup.py file into multiple functions for easier readability.

leahecole

I focused more on the testing and am leaving some of the Spark stuff up to Brad for now

leahecole · 2020-06-09T00:28:17Z

.gitignore

@@ -27,3 +27,4 @@ credentials.dat
 .DS_store


No committing the gitignore plz

leahecole · 2020-06-09T00:38:42Z

data-science-onramp/data-ingestion/setup.py

+    return udf
+
+
+def id(x):


nit - if this is bike_id, can we call it bike id? id is very generic otherwise

I would also add that there should be a brief comment of why a function that just returns its input value is necessary.

leahecole · 2020-06-09T00:39:39Z

data-science-onramp/data-ingestion/setup.sh

+
+gcloud dataproc jobs submit pyspark \
+    --cluster ${CLUSTER_NAME} \
+    --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \


@bradmiro how often would this jar change? I'm worried about this shell script going stale when the jar is updated

Not often at all. I haven't found a reliable way to dynamically call this jar, but transitioning to new Scala versions is not frequent.

leahecole · 2020-06-09T00:45:53Z

data-science-onramp/data-ingestion/setup_test.py

+    storage_client = storage.Client()
+    BUCKET = storage_client.create_bucket(BUCKET_NAME)
+
+    yield


instead of using a global variable BUCKET, you can yield the bucket itself, then pass the fixture to the test as an argument. So, in line 87 you'd have

bucket = storage_client.create_bucket(BUCKET_NAME) yield bucket

then in the test itself, you'd call
def test_setup(capsys,setup_and_teardown_bucket )

and replace every other instance of BUCKET with setup_and_teardown_bucket.

That said, it might be worth renaming the fixture to test_bucket or something similar that you like, since that's what you're yielding.

leahecole · 2020-06-09T00:48:03Z

data-science-onramp/data-ingestion/setup_test.py

+BUCKET = None
+
+
+@pytest.fixture(autouse=True)


From what I can tell, this fixture works without yielding anything, but you could also get rid of the autouse and instead yield the cluster name, which you'd then call as an argument in the definition of the test function (See below comment for the GCS example)

Afaik, it's the same result, different method. Not experienced enough with pytest to know what best practice in this case is.

It looks like we need to yield to separate the setup code from the teardown code for the cluster. And, we do not need to yield the cluster name because it is a global variable that uses a uuid.

leahecole · 2020-06-09T00:51:17Z

data-science-onramp/data-ingestion/setup_test.py

+    assert "null" in out
+
+
+def get_blob_from_path(path):


Is this a helper function? If so, it should be at the top

bradmiro · 2020-06-09T00:45:58Z

data-science-onramp/data-ingestion/setup.py

+        df = spark.read.format('bigquery').option('table', TABLE).load()
+    except Py4JJavaError:
+        print(f"{TABLE} does not exist. ")
+        sys.exit(0)


You can replace sys.exit(0) with return to achieve the same effect albeit potentially more gracefully now you've now wrapped this into a function.

bradmiro · 2020-06-09T00:48:46Z

data-science-onramp/data-ingestion/setup.py

+    new_df.sample(False, 0.0001, seed=50).show(n=100)
+
+    # Duplicate about 0.01% of the rows
+    dup_df = new_df.sample(True, 0.0001, seed=42)


What's the intention behind hard-coding the seeds?

There's no intention, we just thought it might be better for this to be deterministic. We'll remove them in the next revision since there is no reason to hard-code them.

bradmiro · 2020-06-09T00:50:17Z

data-science-onramp/data-ingestion/setup_test.py

+        'cluster_name': CLUSTER_NAME,
+        'config': {
+            'gce_cluster_config': {
+                'zone_uri': zone_uri,


You can use 'zone_uri': '' to have Dataproc automatically select a zone. This is generally preferred unless you have a reason for needing to use a particular zone.

bradmiro · 2020-06-09T00:51:16Z

data-science-onramp/data-ingestion/setup_test.py

+    cluster_client = dataproc.ClusterControllerClient(client_options={
+        'api_endpoint': f'{REGION}-dataproc.googleapis.com:443'
+    })


This will persist from above, you don't need it twice.

bradmiro · 2020-06-09T00:51:29Z

data-science-onramp/data-ingestion/setup_test.py

+
+    # Create cluster using cluster client
+    cluster_client = dataproc.ClusterControllerClient(client_options={
+        'api_endpoint': '{}-dataproc.googleapis.com:443'.format(REGION)


nit: use f string

bradmiro · 2020-06-09T00:55:50Z

data-science-onramp/data-ingestion/setup_test.py

+def get_blob_from_path(path):
+    bucket_name = re.search("dataproc.+?/", path).group(0)[0:-1]
+    bucket = storage.Client().get_bucket(bucket_name)
+    output_location = re.search("google-cloud-dataproc.+", path).group(0)
+    return bucket.blob(output_location)


I would either inline this in the section of the code where you reference it, or move it up to the top of the test file.

bradmiro · 2020-06-09T00:57:47Z

data-science-onramp/data-ingestion/setup_test.py

+    '''Tests setup.py by submitting it to a dataproc cluster'''
+
+    # Upload file
+    destination_blob_name = "setup.py"


Move to top-level as a "final" capitalized field

bradmiro · 2020-06-09T00:57:58Z

data-science-onramp/data-ingestion/setup_test.py

+    blob = BUCKET.blob(destination_blob_name)
+    blob.upload_from_filename("setup.py")


I would do this in your setup steps

bradmiro · 2020-06-09T00:58:37Z

data-science-onramp/data-ingestion/setup_test.py

+    job_file_name = "gs://" + BUCKET_NAME + "/setup.py"
+
+    # Create job configuration
+    job_details = {
+        'placement': {
+            'cluster_name': CLUSTER_NAME
+        },
+        'pyspark_job': {
+            'main_python_file_uri': job_file_name,
+            'args': [
+                BUCKET_NAME,
+                "--test",
+            ],
+            "jar_file_uris": [
+                "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"
+            ],
+        },
+    }


I would move this to the top level as a "final" variable, or make a function "get_job" that just returns this

bradmiro · 2020-06-09T01:01:13Z

data-science-onramp/data-ingestion/setup_test.py

+    }
+
+    # Submit job to dataproc cluster
+    job_client = dataproc.JobControllerClient(client_options={


When I write tests, I personally prefer that the tests have as little code as possible in them. I think if you start the code here and move everything either into "final" variables or in your setup function, it ends up being a bit more succinct.

TBH I don't think this test has much code in it - most of it is asserts, which we could cut down on if you like.

I think the test is now fine as is in relation to this specific thread. My original point was more towards moving configs and such out of the file.

bradmiro · 2020-06-10T00:07:48Z

data-science-onramp/data-ingestion/setup.py

+    degrees = int(angle)
+    minutes = int((angle - degrees) * 60)
+    seconds = int((angle - degrees - minutes/60) * 3600)
+    new_angle = str(degrees) + u"\u00B0" + \


In Python3, all non-byte strings are unicode. You can safely change u\u00b0 to \u00b0.

bradmiro · 2020-06-10T00:11:38Z

data-science-onramp/data-ingestion/setup.py

+    # Declare data transformations for each column in dataframe
+    udfs = [
+        (dirty_data(trip_duration, True), StringType()),  # tripduration
+        (dirty_data(identity, True), StringType()),  # starttime


Unless the identity function is used elsewhere within the job, you can replace it by using a lambda function _ = lambda x: x that you declare just before you create udfs. You can then replace every instance of identity with _.

Unfortunately, when I run nox -s lint, the linter complains about lambda functions and says to use def instead. What do you think I should do?

(echoing my comment from standup) You should instead be able to inline the lambdas instead: dirty_data(lambda x: x, True), StringType()),

I know we talked about this at standup but I forgot to mention for future reference, if Brad or I don't have a good answer, this is a great thing to ask at Python Samples Office Hours! Or to ask me to ask the other samples owners if samples office hours are awhile away.

bradmiro · 2020-06-10T00:21:57Z

data-science-onramp/data-ingestion/setup.py

+import random
+import sys
+
+from time import time_ns
+
+from google.cloud import bigquery
+
+from py4j.protocol import Py4JJavaError
+from pyspark.sql import SparkSession
+
+from pyspark.sql.functions import UserDefinedFunction
+from pyspark.sql.types import IntegerType, StringType


nit: linting imports - https://www.python.org/dev/peps/pep-0008/#imports

bradmiro · 2020-06-10T00:27:22Z

data-science-onramp/data-ingestion/setup.py

+    spark.conf.set('temporaryGcsBucket', BUCKET_NAME)
+
+    df.write.format('bigquery') \
+        .option('table', dataset_id + ".RAW_DATA") \


nit: you can move the temporary bucket into the write operation to avoid editing the Spark conf:

df.write .format("bigquery") .option("table", dataset_id + ".RAW_DATA") .option("temporaryGcsBucket", BUCKET_NAME) .save()

leahecole · 2020-06-10T15:19:03Z

data-science-onramp/data-ingestion/setup.py

+    # Declare data transformations for each column in dataframe
+    udfs = [
+        (dirty_data(trip_duration, True), StringType()),  # tripduration
+        (dirty_data(identity, True), StringType()),  # starttime


I know we talked about this at standup but I forgot to mention for future reference, if Brad or I don't have a good answer, this is a great thing to ask at Python Samples Office Hours! Or to ask me to ask the other samples owners if samples office hours are awhile away.

leahecole · 2020-06-10T15:36:20Z

data-science-onramp/data-ingestion/setup_test.py

+    }
+
+    # Submit job to dataproc cluster
+    job_client = dataproc.JobControllerClient(client_options={


TBH I don't think this test has much code in it - most of it is asserts, which we could cut down on if you like.

leahecole · 2020-06-10T15:41:58Z

data-science-onramp/data-ingestion/setup_test.py

+    bucket.delete(force=True)
+
+
+def get_blob_from_path(path):


Why is this function needed to get the bucket_name when you can yield it from the fixture?

This function is actually getting a different bucket (the Dataproc job output). Also, it is here to convert the URL into a blob so that it can be downloaded as a string. The bucket created from the fixture is used to upload our script.

leahecole · 2020-06-10T15:43:12Z

data-science-onramp/data-ingestion/setup_test.py

+    yield
+
+    # Delete cluster
+    operation = cluster_client.delete_cluster(PROJECT, REGION,


What happens if the cluster isn't found? You made need to add a try/except here. @bradmiro plz chime in if there's a best practice

I typically just let this fail loudly. In any instance where the cluster isn't properly deleted, it is usually indicative of other problems.

bradmiro · 2020-06-10T18:23:11Z

data-science-onramp/data-ingestion/setup_test.py

+    assert re.search("[0-9] h", out)
+
+    # station latitude & longitude
+    assert re.search(u"\u00B0" + "[0-9]+\'[0-9]+\"", out)


nit: remove the unicode u

This reverts commit 580c8e1.

bradmiro · 2020-06-22T18:18:17Z

data-science-onramp/data-ingestion/setup_test.py

@@ -46,12 +43,6 @@
            'num_instances': 6,


Does the runtime change drastically if you change this to 4 or 8?

vuppalli · 2020-06-29T19:04:15Z

data-science-onramp/data-ingestion/setup.py

+def print_df(df, table_name):
+    '''Print 20 rows from dataframe and a random sample'''
+    # first 100 rows for smaller tables
+    df.show()
+
+    # random sample for larger tables
+    # for small tables this will be empty
+    df.sample(True, 0.0001).show(n=500, truncate=False)
+
+    print(f"Table {table_name} printed")


Do you think printing the external datasets as well as the dirty one will affect some assert statements in our test script?

As long as what your asserting is in out somewhere, I don't think it should matter

vuppalli · 2020-06-29T19:04:37Z

data-science-onramp/data-ingestion/setup.py

@@ -91,12 +127,25 @@ def main():
    upload = True  # Whether to upload data to BigQuery

    # Check whether or not results should be uploaded
-    if len(sys.argv) > 2:
+    if '--test' in sys.argv:


I think we changed this to --dry-run?

vuppalli · 2020-06-29T19:11:29Z

data-science-onramp/data-ingestion/setup.py

-RAW_TABLE_NAME = "RAW_DATA"
+DATASET_NAME = "data_science_onramp"
+RAW_TABLE_NAME = "new_york_citibike_trips"
+EXTERNAL_DATASETS = {


Instead of having an external datasets dictionary, can we just call it datasets and group in the citibike one too? Asking because I see that there is some repetitive code below (e.g. if upload: write_to_bigquery ...)? We might need to make some tweaks if we do this because it does not have a URL so not sure if it's worth it. I'm curious to see what everyone thinks!

vuppalli · 2020-06-29T19:13:19Z

data-science-onramp/data-ingestion/setup_test.py

@@ -118,6 +125,10 @@ def test_setup():
    blob = get_blob_from_path(output_location)
    out = blob.download_as_string().decode("utf-8")

+    # check that tables were printed
+    for table_name in TABLE_NAMES:
+        assert table_printed(table_name, out)


Does this need to be its own function if its only being used once? Can we not do assert re.search(f"Table {table_name} printed" in out?

vuppalli · 2020-06-29T19:15:00Z

data-science-onramp/data-ingestion/setup_test.py

@@ -118,6 +125,10 @@ def test_setup():
    blob = get_blob_from_path(output_location)
    out = blob.download_as_string().decode("utf-8")

+    # check that tables were printed
+    for table_name in TABLE_NAMES:
+        assert table_printed(table_name, out)


Is there another way to test that the table actually printed instead of checking that the print statement you added is present? There could be a possibility where the table actually did not print but your print statement did which is not a sufficient check.

If the df.show prints column names, you could check for those

leahecole · 2020-06-29T21:14:09Z

data-science-onramp/data-ingestion/setup.py

+def print_df(df, table_name):
+    '''Print 20 rows from dataframe and a random sample'''
+    # first 100 rows for smaller tables
+    df.show()
+
+    # random sample for larger tables
+    # for small tables this will be empty
+    df.sample(True, 0.0001).show(n=500, truncate=False)
+
+    print(f"Table {table_name} printed")


As long as what your asserting is in out somewhere, I don't think it should matter

leahecole · 2020-06-29T21:14:23Z

data-science-onramp/data-ingestion/setup.py

@@ -91,12 +127,25 @@ def main():
    upload = True  # Whether to upload data to BigQuery

    # Check whether or not results should be uploaded
-    if len(sys.argv) > 2:
+    if '--test' in sys.argv:


leahecole · 2020-06-29T21:24:35Z

data-science-onramp/data-ingestion/setup.py

+    # Check if table exists
+    try:
+        df = spark.read.format('bigquery').option('table', TABLE).load()
+    except Py4JJavaError:


Add a short comment explaining why we'd return if this happens (if I understand correctly, this error will happen when you do the --dry-run option)

leahecole · 2020-06-29T21:25:41Z

data-science-onramp/data-ingestion/setup.py

+    # Create final dirty dataframe
+    df = df.union(dup_df)
+
+    if upload:


What's the difference between this upload and write to bigquery and the other?

You may want to consider also making this a function because it's repeated code.

leahecole · 2020-06-29T21:28:36Z

data-science-onramp/data-ingestion/setup_test.py

@@ -118,6 +125,10 @@ def test_setup():
    blob = get_blob_from_path(output_location)
    out = blob.download_as_string().decode("utf-8")

+    # check that tables were printed
+    for table_name in TABLE_NAMES:
+        assert table_printed(table_name, out)


If the df.show prints column names, you could check for those

…mples into data-ingestion

bradmiro · 2020-08-12T17:09:01Z

This needs a rebase (can help with this). Can we rebase and then merge?

…-docs-samples into data-ingestion

add data ingestion code

92cf763

bradmiro reviewed Jun 5, 2020

View reviewed changes

leahecole requested changes Jun 5, 2020

View reviewed changes

Symmetries and others added 2 commits June 8, 2020 10:24

begin addressing comments

739114a

change submit job

681eaf3

Symmetries and others added 2 commits June 8, 2020 14:43

address code structure and global variable issues

4afbf1c

get dataproc job output and fix linting

744f80c

leahecole requested changes Jun 9, 2020

View reviewed changes

bradmiro requested changes Jun 9, 2020

View reviewed changes

vuppalli added 2 commits June 9, 2020 15:32

fix PR comments

8cd7dc6

linting and global vars

81265d2

bradmiro requested changes Jun 10, 2020

View reviewed changes

address Brad PR comments

3e86bda

leahecole requested changes Jun 10, 2020

View reviewed changes

bradmiro reviewed Jun 10, 2020

View reviewed changes

tk744 and others added 6 commits June 11, 2020 11:45

broken clean.py

580c8e1

Revert "broken clean.py"

4ed5a15

This reverts commit 580c8e1.

optimize data ingestion

e6fe99d

fix linting errors

540acaa

fix minor style issues

a7e2972

remove pip from cluster config

3e5ba3b

bradmiro reviewed Jun 22, 2020

View reviewed changes

load external datasets from url

2106153

vuppalli commented Jun 29, 2020

View reviewed changes

leahecole requested changes Jun 29, 2020

View reviewed changes

tk744 and others added 4 commits July 7, 2020 12:54

added dry-run flag

9febbad

dry-run flag

5d56b97

address some review comments

22be5d3

optimize setup test

f040542

Symmetries and others added 19 commits August 6, 2020 14:13

fix minor style issues

4cdd733

remove pip from cluster config

0769754

load external datasets from url

52da79a

added dry-run flag

2ac38ab

dry-run flag

5ead6b2

address some review comments

3bb0f79

optimize setup test

c753ed7

query data in test

e0ffb41

address live session comments

b0d334b

add break statement

33afd6c

revert breaking table and dataset name change

9acb94e

fix datetime formatting in setup job

c97d454

uncomment commented dataset creation and writing

41406f9

Merge branch 'master' into data-ingestion

0fcb63e

fix import order

ca3c592

use GOOGLE_CLOUD_PROJECT environment variable

cf3aae3

resolve merge issue

c0dc053

Merge branch 'master' of https://github.com/Symmetries/python-docs-sa…

5c3df6e

…mples into data-ingestion

Merge branch 'master' into data-ingestion

4a3c941

Symmetries added 8 commits August 12, 2020 13:13

blacken and add f-strings to dms notation

dc11440

Merge branch 'data-ingestion' of https://github.com/Symmetries/python…

39b5289

…-docs-samples into data-ingestion

change test variables names to match data cleaning

d35b855

blacken setup_test file

6105f79

fix unchanged variable name

35ec8cb

WIP: address PR comments

9561f35

B2BC apply temporary fix for ANACONDA optional component

3242654

remove data cleaning files

b82059b

Symmetries force-pushed the data-ingestion branch from 743a664 to b82059b Compare August 13, 2020 22:28

Merge branch 'master' into data-ingestion

2f655e3

		@@ -0,0 +1,149 @@
		from random import choice, choices, randint, seed

		@@ -0,0 +1,6 @@
		# Submit a PySpark job via the Cloud Dataproc Jobs API

		s[0].lower() if len(s) > 0 else ""])


		def convertAngle(angle):

		return choice([name, name.replace("&", "/")])


		def usertype(user):

		blob = BUCKET.blob(destination_blob_name)
		blob.upload_from_filename("setup.py")

add data ingestion code #1

Are you sure you want to change the base?

add data ingestion code #1

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!