-------------------------  METAREVIEW  ------------------------
The paper proposes a novel and practically relevant two-stage method for predicting live chat intent from browsing history. It uses Longformer+ for initial intent classification and GPT-3.5 for generating fine-grained intents. Reviewers appreciated the innovative approach, practical relevance, and clear presentation. However, they raised concerns about the limited intent classes, lack of open-source code, automation feasibility, and ethical considerations related to user privacy. Despite these issues, the method's strong performance and potential applications in improving customer service justify acceptance. The overall recommendation is to accept the paper.


----------------------- REVIEW 1 ---------------------
SUBMISSION: 2710
TITLE: Forecasting Live Chat Intent from Browsing History
AUTHORS: Se-eun Yoon, Ahmad Bin Rabiah, Zaid Alibadi, Surya Kallumadi and Julian Mcauley

----------- Overall recommendation -----------
SCORE: 1 (weak accept)
----------- Relevance to CIKM -----------
SCORE: 4 (good)
----------- Originality of the Work -----------
SCORE: 4 (good)
----------- Technical Soundness -----------
SCORE: 3 (fair)
----------- Quality of Presentation -----------
SCORE: 4 (good)
----------- Impact of Ideas or Results -----------
SCORE: 3 (fair)
----------- Paper Summary -----------
A two-stage method is proposed to predict live chat intent from a user's browsing history. In the first stage, a fine-tuned classifier (Longformer+) is employed to infer user intent based on the browsing history. In the second stage, both the browsing history and the predicted intent are input into GPT-3.5 to generate the final candidate intents. The results indicate that the classification stage is crucial, as the two-stage method significantly outperforms direct intent generation without the initial classification.
----------- Main Strengths (at least 3) -----------
1- A two-stage method is proposed to explore a new, interesting scenario of generating candidate user intents from user browsing history.

2- Various models were tested for the first stage of the process, with Longformer+ ultimately being selected as the best-performing model.

3- All four figures in the paper were very helpful for understanding both the problem and its solution.

4- The efficacy of the method is evaluated both with and without the intent classification stage to better understand its impact.
----------- Main Weaknesses (at least 3) -----------
1- The number of intent classes is limited to five (INS, AVL, PRI, WTY, RET). Although the paper mentions that these five intents cover the majority of cases, I find this classification insufficient. Expanding the number of classes could provide a more comprehensive coverage of potential intents.

2- Due to the source code not being open-sourced, reproducing their results is not possible.

3- In the paper, it is mentioned that the browsing history is converted into a sequence of key-value pairs before being fed to the intent classifier. However, I am curious whether this conversion process is fully automated. If this conversion is not fully automated, it may not be feasible to apply this method in real-life scenarios.

4- This instruction is used for GPT-3.5 in the task of intent classification: "Predict the customer's intent behind reaching out to a live chat agent after viewing a sequence of the following pages." This prompt instruction does not include the five intent classes—INS, AVL, PRI, WTY, RET. Including these classes in the prompt should improve the performance of GPT-3.5 in the intent classification task. Moreover, providing few-shot examples in the prompt will further enhance performance. I was wondering why the authors did not compare these models against their Longformer+ classifier.
----------- Summary to support your recommendation -----------
Strengths of the paper:
The paper proposes a novel two-stage method for generating candidate user intents from browsing history, thoroughly exploring this innovative scenario. Various models were tested, with Longformer+ emerging as the top performer for the initial stage. The inclusion of four clear and helpful figures enhances the understanding of both the problem and its solution. Additionally, the method's efficacy was evaluated with and without the intent classification stage, offering valuable insights into its overall impact.

Weaknesses of the paper:
the classification system is limited to five intent classes (INS, AVL, PRI, WTY, RET), which may not encompass all potential intents. Additionally, the lack of open-source code prevents result reproducibility. The conversion process of browsing history into key-value pairs for the intent classifier raises concerns about automation; if not fully automated, it may be impractical for real-world applications. Furthermore, the instruction for GPT-3.5 in the intent classification task omits the five intent classes, along with few-shot examples.


----------------------- REVIEW 2 ---------------------
SUBMISSION: 2710
TITLE: Forecasting Live Chat Intent from Browsing History
AUTHORS: Se-eun Yoon, Ahmad Bin Rabiah, Zaid Alibadi, Surya Kallumadi and Julian Mcauley

----------- Overall recommendation -----------
SCORE: 0 (borderline paper)
----------- Relevance to CIKM -----------
SCORE: 4 (good)
----------- Originality of the Work -----------
SCORE: 3 (fair)
----------- Technical Soundness -----------
SCORE: 3 (fair)
----------- Quality of Presentation -----------
SCORE: 3 (fair)
----------- Impact of Ideas or Results -----------
SCORE: 3 (fair)
----------- Paper Summary -----------
The paper addresses the problem of predicting user intent from browsing history in online business platforms, specifically for live chat interactions. The proposed two-stage approach first classifies browsing history into high-level intent categories using fine-tuned Transformers, and then generates fine-grained intents using a large language model (LLM). The evaluation uses a separate LLM to judge the similarity between generated and ground-truth intents, showing significant performance improvements over direct intent generation.
----------- Main Strengths (at least 3) -----------
1. Collaboration with a major online retailer adds practical relevance and real-world applicability to the research.
2. The predicted intents can improve routing of users to specialized agents and prioritize urgent queries, enhancing customer service efficiency.
3. The problem formulation is precise, with a well-defined goal and methodology.
----------- Main Weaknesses (at least 3) -----------
The placement of related work at the end of the paper disrupts the logical flow. Including it earlier would provide better context for understanding the research contributions.
The rationale behind removing observations with fewer than five pages in the browsing history needs further explanation to justify its impact on the study’s findings and ensure the robustness of the study.
 The study relies heavily on automated methods for evaluation. Incorporating more extensive human evaluations could strengthen the validity of the results by providing a more nuanced understanding of the model's performance.
The detailed tracking of user browsing history raises potential privacy concerns. Addressing these concerns with appropriate measures and discussing their implications would enhance the ethical considerations of the research.
----------- Summary to support your recommendation -----------
The paper addresses the problem of predicting user intent from browsing history in online business platforms, specifically for live chat interactions. The proposed two-stage approach first classifies browsing history into high-level intent categories using fine-tuned Transformers, and then generates fine-grained intents using a large language model (LLM). The evaluation uses a separate LLM to judge the similarity between generated and ground-truth intents, showing significant performance improvements over direct intent generation.
strengths
1. Collaboration with a major online retailer adds practical relevance and real-world applicability to the research.
2. The predicted intents can improve routing of users to specialized agents and prioritize urgent queries, enhancing customer service efficiency.
3. The problem formulation is precise, with a well-defined goal and methodology.
weaknesses
The placement of related work at the end of the paper disrupts the logical flow. Including it earlier would provide better context for understanding the research contributions.
The rationale behind removing observations with fewer than five pages in the browsing history needs further explanation to justify its impact on the study’s findings and ensure the robustness of the study.
 The study relies heavily on automated methods for evaluation. Incorporating more extensive human evaluations could strengthen the validity of the results by providing a more nuanced understanding of the model's performance.
The detailed tracking of user browsing history raises potential privacy concerns. Addressing these concerns with appropriate measures and discussing their implications would enhance the ethical considerations of the research.


----------------------- REVIEW 3 ---------------------
SUBMISSION: 2710
TITLE: Forecasting Live Chat Intent from Browsing History
AUTHORS: Se-eun Yoon, Ahmad Bin Rabiah, Zaid Alibadi, Surya Kallumadi and Julian Mcauley

----------- Overall recommendation -----------
SCORE: 2 (accept)
----------- Relevance to CIKM -----------
SCORE: 5 (excellent)
----------- Originality of the Work -----------
SCORE: 3 (fair)
----------- Technical Soundness -----------
SCORE: 4 (good)
----------- Quality of Presentation -----------
SCORE: 4 (good)
----------- Impact of Ideas or Results -----------
SCORE: 4 (good)
----------- Paper Summary -----------
Summary

The authors propose to use customers’ browsing history to predict intent, that is, why the customer needs a live chat agent. The prediction is not one of a set of classes but free-form text describing the intent.

The approach has 2 stages:
Intent Classification: Classify a user’s browsing history into 5 high-level intent categories. Browsing history is represented as a text sequence of page attributes. A variant of Longformer is used for text classification.
Intent Generation: Provide a large language model (LLM) with the browsing history and predicted intent class from step 1 to generate fine-grained intents. (fine-grained intents in the paper are raw utterances). Instructed GPT-3.5 is used for generation.

The paper uses Automatic evaluation using a separate LLM. It measures semantic similarity between generated and ground-truth intents, using few-shot prompted GPT-4.
----------- Main Strengths (at least 3) -----------
1. The paper demonstrates that the proposed method beats competitive baselines, and includes ablation studies.
2. GPT-4’s judgements are validated by having human workers assess 200 intent pairs.
3. The proposed method has many downstream applications in improving conversational e-commerce search.
----------- Main Weaknesses (at least 3) -----------
1. It would be interesting to see an error analysis. The results mention errors in extracting item ID. Are there other common types of errors?
2. It might be helpful to include metrics for how GPT-4 does on the actual prediction task.
----------- Summary to support your recommendation -----------
The paper makes a solid contribution and is technically sound. It should be accepted.


----------------------- REVIEW 4 ---------------------
SUBMISSION: 2710
TITLE: Forecasting Live Chat Intent from Browsing History
AUTHORS: Se-eun Yoon, Ahmad Bin Rabiah, Zaid Alibadi, Surya Kallumadi and Julian Mcauley

----------- Overall recommendation -----------
SCORE: 0 (borderline paper)
----------- Relevance to CIKM -----------
SCORE: 4 (good)
----------- Originality of the Work -----------
SCORE: 3 (fair)
----------- Technical Soundness -----------
SCORE: 4 (good)
----------- Quality of Presentation -----------
SCORE: 4 (good)
----------- Impact of Ideas or Results -----------
SCORE: 3 (fair)
----------- Paper Summary -----------
This paper presented a method for predicting user's intent from browsing history. The Authors represent the browsing history as a long sequence, learn the Longformer model for predicting the coarse-grained intent (broad categories), and generate the fine-grained intents. Further, LLM is used to quantify the similarity between the generated fine-grained intents and ground truth.  

Experiments were conducted on an e-commerce dataset. The Authors compared the proposed approach with language models on the coarse-grained intent predictions. Experimental results show that the proposed approach improves the performance.
Overall, the paper is well-written and easy to follow.
----------- Main Strengths (at least 3) -----------
1. Fine-tuned a pre-trained Longformer to predict the coarse-grained intents using browsing history
2. Promted GPT3-turbo to generate fine-grained intents
3. Leveraged GPT4 for evaluating the generated intents
----------- Main Weaknesses (at least 3) -----------
1. The proposed approach is dependent on the browsing history. What if the browsing history is unavailable?
2. Compared with some language models but excluded the related work on "Predicting intents based on browsing history"
3. Evaluation using GPT4 is not novel.
----------- Summary to support your recommendation -----------
- This paper presented an interesting approach for predicting the broad intent categories using the browsing history.
- The paper uses the generative model to get fine-grained intents and evaluated with GPT4 model.