[go: up one dir, main page]

Page MenuHomePhabricator

MaxMind seems to be mapping the same IP to different countries
Open, Needs TriagePublic

Description

year=2024
month=5
day=10
hour=12

webrequest has multiple countries for the same ip address, and multiple isps for the same country

select * from webrequest
 where year = 2024
   and month = 5
   and day = 10
   and hour = 12
   and ip = ''

Event Timeline

hypothesis so far: maybe some workers are getting MaxMind updates on a staggered schedule from others, so there's always some variation?

Indeed, different versions of the database seems to be present on cluster hosts.

#%%
@F.udf()
def maxmind() -> str:
    import maxminddb
    from datetime import datetime
    reader = maxminddb.open_database('/usr/share/GeoIP/GeoIP2-City.mmdb')
    ts_epoch = reader.metadata().build_epoch
    return datetime.fromtimestamp(ts_epoch).strftime('%Y-%m-%d %H:%M:%S')
    

@F.udf()
def host() -> str:
    import socket
    return socket.gethostname()

#%%
out = (spark.range(1, 10000)
 .repartition("id")
 .withColumn("maxmind", maxmind())
.withColumn("host", host()) 
).cache()
#%%
out.groupBy("maxmind").agg(F.collect_set("host").alias("hosts")).show(truncate=False)

Returns

+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|maxmind            |hosts                                                                                                                                                                                                             |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-05-28 15:24:59|[an-worker1142, an-worker1104, an-worker1107, an-worker1139, an-worker1125, an-worker1120, analytics1076, an-worker1165, an-worker1159, an-worker1161, an-worker1118, an-worker1103, an-worker1098, an-worker1126]|
|2024-04-23 12:31:09|[an-worker1132, an-worker1168, an-worker1141, an-worker1144, analytics1071, an-worker1111, an-worker1160, an-worker1102, an-worker1122]                                                                           |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Note this is only looking at the city db, and only for a random subset of hosts that spark happens to run partitions on.