[go: up one dir, main page]

0% found this document useful (0 votes)
27 views2 pages

Taiyo - AI - Data Engineering (Web Scraping) Trial Task (Updated)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views2 pages

Taiyo - AI - Data Engineering (Web Scraping) Trial Task (Updated)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

TAIYŌAI INC.

Data Engineering Trial Task


Objective:
Find, scrape, standardize, and continuously update data regarding construction and
infrastructure projects and tenders in the state of California.

Part 1: Research and Data Sourcing


Task: Research and identify 5-10 reliable data sources about construction and infrastructure
projects and tenders in California from official sources.

Methodology: Research online to identify these sources. Explicitly state your research process.

Part 2: Data Extraction and Standardization


Task: From the provided Table 1 and your own list, scrape data in a structured format using
REST APIs or libraries like lxml, BeautifulSoup and Selenium. Use language models (like
OpenAI API, Mistral 7B, Llama2, or other open-source models) for scraping unstructured data
into structured format. Bonus: Create a chatbot on top for the data

Requirements:
Demonstrate how you can build data products (DPs) to scrape data from multiple sources in an
efficient and scalable manner. Standardize the scraped data according to the guidelines
provided in Table 2.

Evaluation Criteria
● Scalability: Ability to scrape multiple sources effectively.
● Adherence to Standards: Conformity with the provided data standards; penalties for deviation.
● Automation and Continuity: Quality of the proposal for continuous data updating, including
details on cron monitoring and production environment suitability.

Deliverables
Candidates should share a Google Drive folder containing:
1. Python Scripts: Code used for data scraping and standardization.
2. Sample Datasets: Examples of the data extracted and standardized.

Notes to Candidates
● Scrape each project with all details and not just the list of projects from source. Use best
practices to design your scrapers. Pay close attention to the data standards and ensure
your methods are scalable and suitable for a production environment.
● Clearly articulate your use of AI or machine learning models, specifically in the context of
data sourcing and any preprocessing tasks.
● Demonstrate a thoughtful approach to continuous data updating and monitoring.

Task Submission: Kindly fill out this form to submit your task-
https://forms.gle/BGNFXQr4VeJed7ug8
Suggested Data Sources

• Richmond
o https://www.ci.richmond.ca.us/1404/Major-Projects
• Eureka
o https://www.eurekaca.gov/744/Current-Projects
o https://www.eurekaca.gov/305/Completed-Projects
• Cal-eProcure
o https://caleprocure.ca.gov/pages/Events-BS3/event-search.aspx
• City of Irvine
o https://cityofirvine.maps.arcgis.com/apps/Shortlist/index.html?
appid=2d663ab00d0d4eee8cbcd41a1bae0b93

Table 2. Data Standards List

Attribute Description Type

Generic
original_id Unique ID from source for the project String

aug_id Generated by Taiyo (UUID) String

country_name Name of the country of the project String

country_code ISO 3-letter Country Code String

region_name Region Name for country according to World Bank String

region_code Region Code for country according to World Bank String

latitude Latitude of the location of the project Float

longitude Longitude of the location of the project Float

url Url of the project on the source String

title Title of the project String

description Summary/description of the project String

status Status of the project String

timestamp Date associated with the project Date

timestamp_label Label for the timestamp (published_date, posted_date, etc.) String

budget Value/Cost of the project Long

budget_label Label for the budget (Total Project Cost, Total Funding etc.) String

currency Currency for the budget (ISO-4217 code) String

sector Sector classification of the project (Transport, Health, etc.) String

subsector Subsector classification of the project (Road, Hospital, etc.) String

document_urls Links for any files/attachments of the project String

You might also like