[go: up one dir, main page]

0% found this document useful (0 votes)
10 views8 pages

HACKATHON

The document discusses the importance of collaboration and resource utilization in tackling time series regression problems, emphasizing the value of sharing ideas among peers. It outlines a framework for data processing and analysis, including data cleaning, univariate and bivariate analysis, and model setup using Python. Additionally, it encourages participants to explore various data science resources and engage with mentors for guidance.

Uploaded by

ramahow123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

HACKATHON

The document discusses the importance of collaboration and resource utilization in tackling time series regression problems, emphasizing the value of sharing ideas among peers. It outlines a framework for data processing and analysis, including data cleaning, univariate and bivariate analysis, and model setup using Python. Additionally, it encourages participants to explore various data science resources and engage with mentors for guidance.

Uploaded by

ramahow123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

And some some useful result some possibilities for resources so there's

some network you know it's I think each other you know is already
important resource like we these kinds of problems are really good if you
are collaborating you know thinking about the problem and chatting to
people who are working on it together because that's where like at least
from my experience you get like the best ideas is like bouncing off other
people and that's why you know we've really encouraged working in
groups and collaborating I think you know we have Jessop members and
productive insights members I'm here if you guys have any questions or
you know want to chat about how you go about it I think previous and Neil
might be open to you know giving advice or their thoughts yeah courses
nodding his head so yeah I would really recommend chatting to each
other and chatting to us yeah OK and then you know another useful thing
so you know you you you kind of faced with this task you have to try to
figure out like how you're gonna go about this kind of problem you know
this is a time series a regression problem and I think you know we have so
many resources online kaggle itself is already cool platform because you
know there's a lot of competitions a lot of notebooks a lot of things that
people have tried before and you can you can you know just go on there
and search you know for example employment prediction and then you
know you'll get a whole bunch of keep models that people have already
tried don't don't trust them fully you know like but it might be useful to
look at what other people have tried before OK so and then obviously you
know Googling and researching find like looking at resources online it's
also useful just to to learn about these kinds of problems and stuff like
that OK so now I just want to walk through the start of notebook and just
talk through all the steps that are there just let you guys have an idea of
you know how you you know a good framework for approaching this kind
of problem so uh OK So what I've done here is I've downloaded this and
I've put into a full dive open that folder in Visual Studio code that I have
Python installed in but you can also do like in a normal 3% of book you
know whatever you're comfortable with if you guys are more comfortable
with our feel free to do that I'll start number OK so it's a like I think it's
probably more useful to work in Python And the main thing that you have
to remember to do is just put the student notebook that you're working on
in the same directory as your training you have trained C so he needs to
download GF training from from here so go to data finds DF train and

Shows us the total amounts of data we have and the number of columns
and then we've got a list of all the comments here you know there's
obviously also described on the on the capital page but it's nice to see
how they are in the notebook and to confirm and see how they map as
well so the first thing that you want to do is basically go through every
you know go through each column and you want to clean it out so remove
any nans and you know encode this as well as get rid of all the nonsense
so yeah and that's kind of what this first step is doing you know
converting strings to datetime variables you know filling in the ohh my
laptop just died OK so it's going to be two minutes but for now I would
recommend you all open up your you know download the southern book
cameras you know start looking at that apologies does anyone have any
questions at the moment thank you i don't know but

and some of these are useful steps like converting the status to categories
instead of like numbers here which is you know you can check out what
that means in the documents and it's not what has to do that you know
you just search it up so it means it's like so you can search it up in the PDF
and then you see it it means nothing employment educational training so
that's basically unemployment what this is just mapping from the
numerical and the way that it's putting into the data sets to like a a
categorical value or or I think this might actually be in the opposite
direction ohh no it it is going from that and so yeah so I would encourage
you to go through each of the the comments that are there or besides the
ones that are shown here or even though as well as the ones that I share
and then see how they uh like figure out how you think it would be best to
to change them all or make them a bit more accessible to the model umm
so that it was during all of that and then the next step is you know when
things start to get a bit more interesting so doing univariate analysis on
the different kinds so you know go with just seeing how many how many
of the rows in the data that each of the categories under industry we have
a question or so like I think these are it's kind of like these are suggestions
you don't have to you don't even have to use the student book as long as
your output is the same as output I mean the same formats then that's
fine you don't have to do any of these steps but these are kind of like
suggested steps that we've started out on but I would strongly
recommend you go through each of them and see how you can edit them
and improve them as well this is a very like basic baseline thing that we
would recommend yeah but I would like the whole idea is to change it as
to create your own models so you don't have to run through yeah so I
mean this this section is kind of just I'm going to go over this pretty briefly
because I'm sure a lot of you guys want to get going and get started and
then if you have any questions with these feel free to ask OK so you know
we have we're looking at these different this job value counts is useful to
should how much of the different values under each column I have this
tenure graph which is useful so I think up here you know we calculated 10
year as the time between the current or the survey dates and the start the
starting time of the most recent employments or the most recent position
I've been in so let's say they let's say this particular person that we are
looking at got a job it's basically the time since they got from when the
stop pick up of when they got the job to now ohh when the survey was
answered and you can see here that's the side I think Tanya will be the
number of days and you can see that it's you know just trying this you can
see that it's skewed to the left I mean skewed to the lowest values that
are available here and then you know that's explained in the this
notebook you know and this is looking at things like what what is the
average age of the people that you know in this and then we have some
bivariate analysis so this is basically looking at two variables over time all
over another variable so it's and you know this is quite an interesting and
interesting one you know you see like the council employee and employee
for the different ages and you can see that there's a relationship here and
you know it's these kinds of relationships that you can actually put into
your model using different techniques you know whether you're creating a
new feature or you know waiting things somehow so an example of this is
you can see the 18 year olds there's actually so few of them that are
employed you know such a small percentage of them and might even be
beneficial to what's likely for all of them at 0 in your you know in the way
that you you modulate your results and then you know we're kind of
talking about population release of data here and you can see the council
employed people the agenda province um and that's something that you
can you know look at and see how you can put it insert you know example
of of this kind of thing is you can see that these provinces are you know
3468 and nine are kind of like similar or even to a scenario almost the
same um you could do separate models for different provinces for
example you know depending on your further analysis of this yeah and so
so then we kind of start looking at the like constant level values so you
know something to remember is that what we're trying to predict is for the
next six months so June the quarter starting from June of 2024 and so but
the the format of the data we have is just the questionnaires for each of
those times so we need to kind of bring them all together like combine the
data and that's what this section is kind of doing yeah serve yeah so this
is this is another useful graph it kind of shows counts of I think it's the
council of employed people at the time yeah the number of employed
people over time based on the agenda which you know is obviously
exactly what we're trying to do and you can kind of see that trend and you
know that should inform how you put together your model and then you
know it's just the actual modeling what we've put in into this notebook is
the art arima model which is that time series machine learning thing and
yeah so that's just setting up the model and then you know we can we we
put in this thing looking at seasonality and residuals which are explained
in the notebook and then there's a validation session which is basically
just checking I mean I could this really explained about validation but this
is like the template code that you can change and and work with and then
we calculate the ME based on our validation and then the final step is to
run people again I'll run the prediction based on based on the you know
the data that you like the inputs data that you I needing to provide to the
house this fall so instead of you know in this validation section you're
predicting dates that you already know the answers to obviously the
training of that is not does not include that day so you're predicting for but
then this test is you know getting ready for the actual submission and
that's all done here and then the final submission you know we just set it
up as a data frame and say that as a CSV and then you just upload that
CSV to submit prediction umm so yeah I think that's that's basically it for
the southern night book does anyone have any questions on that side of
notebook at the moments I kind of just run through it but you guys all
seem to be good umm yeah and then so I have some I put up some
resources here I just have a list of different like data science things that
you can read about if you're having trouble with some ideas maybe I can
leave it up for I'll find a way to make these accessible to you through the
through the course of today but yeah so that's basically it I just want to
say thank you so much for coming by and all the best yeah if you're still
looking for a group I think speak to the shade you don't have to be in a
group but we strongly encourage it so yeah thank you so much and all the
best the kids ohh yay hey this is the voice of my Quentin chances thanks
thanks Adrian you know the the run of play now is get started you'll be
around on answering questions etcetera I think the pizzas are off at 12 or
somewhere around there the officers it's now your sort of opportunity to
just get down to it and and and work on the problem

Because I know people within the team society and received notifications
will come through because we saw the stuff so like for example I know it's
an online it's an online Q&A session

Yeah hi this is the voice of leflunomide this is thoughts this is so um let me


see if I can copy this if the next thing is is is is going to be upper decision
about something and the miniaturized I want to speak Chinese what at
this point sorry I feel like I'm here for for for for for the weekend 2 she

Just keep swimming just keep swimming just keep swimming just keep
swimming for clownfish he really isn't that funny thank you who ohh home
safety fish are friends not food I shall call him squishy and he shall be
mine and he shall be my squishy sharpay shark bait bait thank you P
Sherman 42 wallaby way Sydney you totally rock squared so give me
something dude I'm I'm I'm I'm quoting things from finding Nemo and
Finding Dory I promise I will never let anything happen to you Nemo thank
you much come on now what you Ohh

Hey look balloons it is a party hey conscious am I dead and have you
friends in other countries trust it's what friends do like Speech Speech
whoa whoa whoa hey relax take a deep breath it's funny that it's their
holiday that they're back what what I just bought it at school so this
semester starts in life well Sabah well a well SWL SBWL thank you it's
been so long more soon I don't see any first year ohh and get up да ей до
свидания как твои дела короче спасибо как твои дела вот хочу часы
своей говорю э это паренёк еик спасибо я д э Путин героический для
пути да нет эй пи ну все нет а ну а п да d fm щит с Пекин инглиш что
если речь ну арно алло Марокко да чеков киста**** оу-***** мама
папа ***** у ***** это nous минус это чуне вообще сообщит да в
каменных джунглях сиси а я окей из лайк блядь сейчас красивая

‫إيه عشيرة أال دوناي ش في أم أنا مشيرا على كان اي حكينا جيينا أنا أنا غير حكينا هنا عندنا‬
‫عشيرا مسقطة أمت ورا أشيرا ماسك با آه تورا كا عشيرة عشيرة عشيرا‬
Hola no soy torra tweet
Hola soy torra can you help me find my baby daddy aguas mandas email y pues sí
okay sí sí sí sí sí sí sí sí sí dije no
Economics or something would you better for the dedicated so essentially you're
going to have you have only models we have the original data that you that you start
with and then that's processed either what what we do is we just had one set of
process data and in that would be fed into all the models but I think what would be
cool to experiment with is maybe to look at like specifically keying data for specific
models because I know some models like don't like certain data and especially like
with the with the traditional put this stuff this is machine learning stuff so I think that
would be quite interesting to have a look at but in general you you kind of want to
have a data processing done 1st and then you can start feeding that to the models
and it's good to take those separately because yeah a lot of stuff changes if you
change something in in the with the model isn't it yeah they impact each other in a lot
of ways the other thing I'd say is most of the models like are really really good so and
the the the stuff that you can find like that people have done so I'd say what will this
important but you the difference between good models and great models is is is
noticeable but not as big as the difference between someone who can do really good
data engineering versus like you know average feature engineering so I think
because mentioned earlier but like you wanna have the first thing you wanna do is
just get an intuition of the data and I hope most of you have done this at this point
but just generate graphs of every single variable you wanna see every single
potential relationship that can happen and just have a look like use your eyes
because there's a lot of stuff you can kind of spot from the data without even you
know typing the line of code which is really useful it's as far as how our darkening
was concerned I don't do you guys know like 100 including is I OK so essentially with
a lot of these like I I only checked out the data today for this year's thing but I see
something like the provinces are ordered numerically and a lot of select case then
will be 1/2 it will be two whatever and a lot of models will then order it and think you
know one is better than two or whatever whereas you don't know that and you would
actually you know people tested difference between them so I'd suggest taking those
numeric to those numeric columns and converting them to something either through
one hot encoding or or maybe categorically or something like that some models don't
like certain type of data so you just have to just depends on the situation but as far
as like this is where the fun part comes in though because you have this type of
feature engineering in this way you can as I said earlier you can get back a big
advantage over people so last year they said was different instead of having a we
had a geographic data about people that province this school council and then we
had the metric Marks and stuff like that so that we could then use to impute to impute
substitute to kind of like workouts so this is like where you kind of just have to use a
lot of common sense things so for example we worked out that passing maths lips
was actually worse than failing maths call which may have not been obvious if you
think about it but after looking at the data you can see that like you'd rather someone
who's like just fell that's corny as passed smarts list we created actually go to the
topic and code is just now another thing that we noticed is jade discrimination it's a
little bit more interesting than just obviously women are typically not not fair in the job
markets because of a lot of reasons but at what point that it happens really
interesting he noticed for a lot of high achieving women if they come up with like
really really good trip marks actually more likely to get a job whereas for most for
most women through their have much harder time getting into job markets even with
comparable marks to the a lot of the men coming through so I thought that was quite
interesting other things we had and that any of this year and if they're using Tania
and and data but basically we take you was is how long you've been in your current
position so how you've either been employed full or unemployed so obviously that's
really important to know whether someone's unemployed or not because if if you if
you're employed and you've been employed for three years it's a very good thing but
if you're employed you unemployed for three years it's a very bad thing so so like
that sort of each engineering is also useful like you know you have to use it of course
the other thing I would say is don't be scared to control data so I know you think
having the most data possible is the best scenario but often data I don't know if you
know multicollinearity is but essentially sometimes data conflicts with each other
because they are pointing out similar like there's a cross cross contamination in the
data so there's a lot of ways you can test for I think where where the lasting effects
or like last last or some strike could that features we had a drop in feature for that but
at the same time you could just go through and do like the VIFF analysis and just
see what features are very related to each other as far as the ohh there's also target
and coding so things like where downtown feature where we worked out that like in
some cases the economy was just at some dates when they did the testing so to ask
people if they employed or not the economy was in a worse position so a lot more
people were gonna be retrenched so or just looking for work so we worked that out
just again from grouping the data and just looking at it and then we also saw
obviously unfortunately provinces of the day like some provinces you just far less
likely to get a job unfortunately just because of the the economic situations on them
now once you think city calls there's a full list on our of all the changes that we made
we also interesting yeah I think that's about it does anything else that you remember
ohh yeah so so with with the model as I said we used the lasting effects to basically
work out which features the two closely correlated and dropped them it's that's it's
similar to like a lasso Ridge or whatever but for cross validation I think it's different
this year but we used a shot 5K four to try and we this area under the curve last time
and I don't think that's going to work so well now because it's in time series but you
cross validation it's important but you can't just use the submission if you can't get
one it it it works well enough another thing is be prepared to test a lot of models that
we wish through tons of them and just you know SolidWorks yeah there's there's a
whole that's just check the GitHub if you wanna see but what we've got here is the
the similarity again you you want models that perform that pick up different features
in the data to be grouped together they're really really powerful so we worked on a
lot of the machine learning boosted models they've they've they generated a lot of
false positive false positives so they're very very optimistic about whether something
employed or not so it worked really well to have those really optimistic models that
get really really high scores combined with some of the what appeared to be less
while performing models but it's just because there were probably pessimistic yes I'll
get to that now actually because it's that that's also a really good question but
basically so see what else you've got yeah another thing with some of these models
give sucky different outputs if you rerun them multiple times they're not deterministic
so just check that you know like sometimes you need not run these models multiple
times and just to get the best results out of them so with combining we tried a bunch
of different strategies we tried a voting algorithm where centrally like the the big
optimistic boosted algorithms would have some more weighting and they would put a
vote down where the fourth final target variable would be and then other models they
all support votes and maybe if if other dority of the models are voting one way it'll you
know override the the main models of and but that actually doesn't perform too well
we did not just using a stacking algorithm which basically of combines them this is
like a geometric meaning I think of all of them and then the final results yes yes just
the all the base models were here based learners and they all made their own
predictions and then these were the models there are basically units and then the
final model here we had was an MP and then that made the final decision on the for
each value for each prediction so I want to keep you too much longer but then we
also did you can actually take if you take your top submissions and then you take a
meal of them and also sometimes gives a bit of value you don't know why but that's
another thing that we did so we we took like our law our top five submissions that we
had checked against the leaderboard and then we made another prediction and then
we took the geometric mean of all of those predictions and then resubmitted that and
then it had a better result I think it's because as we were making submissions the
model was evolving or it was changing and so the results were different every time
and so it had an error correcting capability and then we obviously got a high score
any questions yeah thoughts if you look here we noticed precision is a is a good
indicator for fitting not an bold crock is just the area of of under a curve and I think it's
because precision is it's regularly getting within the ballpark of what it needs to do
could you but basically it's it's it's because it's it's fitting that like directly to all the data
is so if your precision is really high and your accuracy something lower it's probably
because you're overfitting the data I mean we could see if we were official notice
when we were using stratified K fold we split the data set into five separate blocks
each with their had a fair representation of the data and then so the models would be
trained on four of those blocks and then predict on the 5th and then it does that for
each combination and then you can see the work rock through all of those times and
then you can see if it's overfitting or not if it was overfitting then all the other folds it
would have a lower or rock so you just have to try and see if you are fitting or not any
other questions yeah how was your attention terrible don't take it don't don't even
sign off no I don't so so I have to say job so so working to bring insights is really
really cool I'd I'd say that you know they often say you know like unions nothing like
the workplace it's totally different but the way I describe is is it kind of felt like a like a
really cool unique projects like what that's what I really enjoyed about it and you're
kind of given a lot of space to to your own problem solving and kind of gets
conclusions yourself which is really really nice of work hours of flexible and you can
kind of again work with whatever works for you is really really really interested like
really really cool projects that you're working on so yeah you can go if anyone wants
to come through the code with us we can show them off this but I think the rubies
yeah nice girth Jonathan and please check to them for some tips and tricks
afterwards we will give you the second-half of the rugby to continue working and
then after that we'll have the the draw and so close things out thanks

You might also like