CN105095070A - Method and system for obtaining QQ group data base on test assembly of browser - Google Patents
Method and system for obtaining QQ group data base on test assembly of browser Download PDFInfo
- Publication number
- CN105095070A CN105095070A CN201510363954.3A CN201510363954A CN105095070A CN 105095070 A CN105095070 A CN 105095070A CN 201510363954 A CN201510363954 A CN 201510363954A CN 105095070 A CN105095070 A CN 105095070A
- Authority
- CN
- China
- Prior art keywords
- group
- browser
- testing assembly
- page
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000006399 behavior Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 9
- 238000012544 monitoring process Methods 0.000 claims description 5
- 239000008358 core component Substances 0.000 claims description 2
- 230000004044 response Effects 0.000 claims description 2
- 238000013481 data capture Methods 0.000 claims 9
- 238000010422 painting Methods 0.000 claims 1
- 238000010200 validation analysis Methods 0.000 claims 1
- 238000012423 maintenance Methods 0.000 abstract description 5
- 238000013461 design Methods 0.000 abstract description 4
- 238000004088 simulation Methods 0.000 abstract description 4
- 239000000306 component Substances 0.000 description 34
- 238000010586 diagram Methods 0.000 description 12
- 238000012795 verification Methods 0.000 description 9
- 230000006854 communication Effects 0.000 description 8
- 238000005336 cracking Methods 0.000 description 6
- 238000013480 data collection Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008571 general function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
本发明公开了一种基于浏览器测试组件的QQ群数据获取方法与系统,通过给予浏览器测试组件,实现用户行为模拟,由此方式实现Web端QQ等登陆操作,结合并行化设计,面向QQ群实现即时聊天信息和非即时信息的数据获取。本发明能够避免人工登陆操作,在快速获取数据的同时,兼顾了获取数据的完整性,同时在后期维护上成本投入更小,能够根据版本变化快速修改并投入使用。
The invention discloses a method and system for acquiring QQ group data based on a browser test component. By providing the browser test component, user behavior simulation is realized, and in this way, the login operation of QQ on the Web side is realized, and combined with parallel design, it is oriented to QQ. The group realizes the data acquisition of instant chat information and non-instant information. The present invention can avoid manual login operation, take into account the integrity of the acquired data while quickly acquiring data, and at the same time, the cost investment in later maintenance is smaller, and can be quickly modified and put into use according to version changes.
Description
技术领域technical field
本发明涉及社交网络数据采集领域,是一种基于浏览器测试组件的QQ群数据获取的方法,以实现更有针对性、准确性的QQ群数据获取的方法和系统。The invention relates to the field of social network data acquisition, and is a method and system for acquiring QQ group data based on a browser test component to realize more targeted and accurate QQ group data acquisition.
背景技术Background technique
腾讯QQ是目前国内最具影响力的社交网络,QQ注册用户由1999年的2人(马化腾和张志东)到现在已经发展到上亿用户,2013年8月QQ活跃用户数已达8.185亿,2014年4月11日21点11分在线人数突破两亿,QQ用户可以在QQ群中发表文本和图片消息、上传相册图片、上传文件、组织活动投票等操作,如今已成为腾讯公司的代表之作。IM的发展和普及同其自身所具备的特征有着必然的联系。IM具有实时性、在线性与文本交互性等特征,满足了人们在日常生活与办公环境中交流与协作的需要。Tencent QQ is currently the most influential social network in China. QQ registered users have grown from 2 people (Ma Huateng and Zhang Zhidong) in 1999 to hundreds of millions of users. In August 2013, the number of QQ active users reached 818.5 million. At 21:11 on April 11, the number of online users exceeded 200 million. QQ users can post text and picture messages, upload album pictures, upload files, organize event voting and other operations in the QQ group. Now it has become a representative work of Tencent. . The development and popularization of IM are necessarily related to its own characteristics. IM has the characteristics of real-time, online and text interaction, and meets people's needs for communication and collaboration in daily life and office environment.
当前针对QQ的信息获取的方法主要有两种:Currently, there are two main methods for obtaining information on QQ:
(1)数据库破解,即对即时通信通信软件运行后存取的嵌入式数据库进行破解。此方法对QQ群信息采集而言技术实现难度较大,因为QQ客户端本地数据库与QQ版本的变动有较大关系,通常会通过MD5加密数据库文件,或者是数据和数据库文件的双重加密,破解难度相当大,破解成本无法控制。(1) Database cracking, that is, cracking the embedded database accessed after the instant messaging software runs. This method is very difficult to implement technically for QQ group information collection, because the local database of the QQ client has a lot to do with the change of the QQ version. Usually, the database file is encrypted by MD5, or the data and the database file are double-encrypted. The difficulty is quite high, and the cracking cost cannot be controlled.
(2)模拟web版QQ通信过程,即对那些提供web访问的即时通信软件进行模拟http的get和post请求,获取数据。QQ虽然有web的访问方式(如SmartQQ),但是此方法对采集QQ而言工作量大,繁琐。需要对访问QQ的每一条指令进行抓包分析,并获取相应的get或post地址及内容。另外其通信过程本质上采用了异步加载的技术,Web通信过程往往一次数据请求存在多个请求依赖,下一次请求往往需要上一次请求的结果作为输入,对此通信过程分析工作复杂,难以覆盖所有返回的结果。随着WebQQ协议版本的升级,采集系统也需要随之进行相应的潜在大规模修改,后期工作量很大,维护不容易。(2) Simulate the communication process of the web version of QQ, that is, simulate HTTP get and post requests to those instant messaging software that provides web access to obtain data. Although QQ has a web access method (such as SmartQQ), this method is heavy and cumbersome for collecting QQ. It is necessary to capture and analyze each instruction accessing QQ, and obtain the corresponding get or post address and content. In addition, its communication process essentially uses asynchronous loading technology. In the Web communication process, there are often multiple request dependencies for one data request, and the next request often needs the result of the previous request as input. The analysis of this communication process is complicated and difficult to cover all the returned result. With the upgrade of the WebQQ protocol version, the acquisition system also needs to undergo corresponding potential large-scale modification accordingly. The later work is heavy and the maintenance is not easy.
由于QQ客户端协议的私有性,WEB端QQ协议频繁更改以及QQ开放API的局限性,并不适合对其大规模采集。Due to the private nature of the QQ client protocol, the frequent changes of the QQ protocol on the WEB side, and the limitations of the QQ open API, it is not suitable for large-scale collection.
浏览器测试组件(browserautomation)是一种基于浏览器内核的测试工具,无需特殊设置即可使用。它最初主要是面向web应用,通过模拟用户行为进行web应用与浏览器的兼容性测试、功能测试和压力测试等。基于浏览器测试组件的数据获取方式,由高级程序驱动浏览器测试接口,实现自动化的网页浏览。待浏览器完成JS解析和网页渲染后,通过对网页DOM元素定位并解析后可以获取AJAX加载的数据。这种方式能够获取全部的社交网络数据,避免官方API方式的限制。The browser testing component (browserautomation) is a testing tool based on the browser kernel, which can be used without special settings. It is mainly oriented to web applications at first, and conducts compatibility tests, function tests, and stress tests between web applications and browsers by simulating user behavior. Based on the data acquisition method of the browser test component, the browser test interface is driven by a high-level program to realize automated web browsing. After the browser completes JS parsing and webpage rendering, the data loaded by AJAX can be obtained by locating and parsing the DOM elements of the webpage. This method can obtain all social network data and avoid the restrictions of the official API method.
综上分析,目前针对数据获取技术和方法由于不同平台特性难以做到通用,同时针对部分社交网络的数据获取版本依赖性较大,灵活性较低,鲁棒性较差,长期来说,维护升级的工作量巨大。To sum up, the current data acquisition technologies and methods are difficult to be universal due to different platform characteristics. At the same time, data acquisition versions for some social networks are highly dependent, less flexible, and less robust. In the long run, maintenance The workload of upgrading is huge.
发明内容Contents of the invention
基于以上分析,为了实现对QQ群信息和消息的采集,本发明提出了一种基于浏览器测试组件的QQ群数据获取方法与系统。Based on the above analysis, in order to realize the collection of QQ group information and messages, the present invention proposes a method and system for acquiring QQ group data based on a browser test component.
本发明主要包括两个方面:The present invention mainly comprises two aspects:
(1)通过浏览器测试组件获取QQ群的资料信息和即时聊天消息。(1) Obtain the data information and instant chat messages of the QQ group through the browser test component.
(2)基于浏览器测试组件的并行化QQ群消息获取。(2) Parallelized QQ group message acquisition based on the browser test component.
具体地,本发明包含以下内容:Specifically, the present invention includes the following:
一种基于浏览器测试组件的QQ群数据获取方法,包括以下步骤:A method for acquiring QQ group data based on a browser testing component, comprising the following steps:
1)在客户端通过浏览器测试组件的端口,启动浏览器,跳转进入Web端QQ页面;1) On the client side, use the browser to test the port of the component, start the browser, and jump to the QQ page on the Web side;
2)模拟用户行为,登陆欲获取QQ群数据的QQ采集账号;2) Simulate user behavior and log in to the QQ collection account that wants to obtain QQ group data;
3)通过对QQ采集账号登陆的页面DOM树的监听,不断获取即时信息和非即时信息,并存储到MySQL数据库。3) By monitoring the DOM tree of the page where the QQ collection account is logged in, real-time information and non-real-time information are continuously obtained and stored in the MySQL database.
进一步地,所述浏览器测试组件是在启动浏览器之前进行初始化,并向浏览器注入JavaScript库,用于执行客户端发送的命令请求的核心组件。Further, the browser testing component is a core component that is initialized before starting the browser, and injects a JavaScript library into the browser to execute the command request sent by the client.
在本发明中,我们引入了无界面浏览器(headlessbrowser)的应用,无界面浏览器的优势在于包含了浏览器内核但是没有UI界面,这样通过避免图形渲染和加载可以减少计算机资源占用。然而,本发明同时支持有界面浏览器(如Chrome、Fierefox等),对于当前普遍使用的异步加载技术,基于有界面浏览器也可以获取较好的性能。In the present invention, we introduce the application of a headless browser. The advantage of a headless browser is that it includes a browser kernel but has no UI interface, so that computer resource occupation can be reduced by avoiding graphics rendering and loading. However, the present invention supports browsers with interfaces (such as Chrome, Firefox, etc.) at the same time. For the asynchronous loading technology commonly used at present, better performance can also be obtained based on browsers with interfaces.
进一步地,步骤1)包括:通过浏览器测试组件中的Hub组件不同端口实现多个浏览器启动,客户端针对每个浏览器启动一个线程(可根据实际需要,编写用于调用浏览器测试组件和数据处理等功能的程序来实现),即通过多线程实现对多个浏览器的控制,每个线程对各自启动的浏览器分别发送命令请求。通过该过程可以实现基于浏览器测试组件的并行化QQ群消息获取。Further, step 1) comprises: realize a plurality of browsers to start by different ports of the Hub component in the browser test component, and the client starts a thread for each browser (can be written for calling the browser test component according to actual needs) and data processing and other functional programs), that is, to realize the control of multiple browsers through multi-threading, and each thread sends command requests to the browsers started separately. Through this process, the parallelized acquisition of QQ group messages based on the browser test component can be realized.
进一步地,步骤2)中所述用户行为包括点击、输入、拖拽等行为操作。Further, the user behavior in step 2) includes behavior operations such as clicking, inputting, and dragging.
进一步地,步骤2)包括:登陆时输入QQ采集账号的账号和密码,通过定位页面验证码元素节点,判断是否需要人工输入验证码,若需要,则提示操作人员进行人工输入,等以上流程完毕,点击登陆按钮进行登陆。Further, step 2) includes: inputting the account number and password of the QQ collection account when logging in, and judging whether a manual input of the verification code is required by locating the page verification code element node, and prompting the operator to manually input the verification code if necessary, and waiting for the above process to be completed , click the login button to log in.
进一步地,步骤3)中客户端通过调用浏览器测试组件提供的不同接口发送浏览器测试组件规定的相应参数,浏览器测试组件接收参数,对客户端请求进行解析,然后通过HttpProxy发送JS命令通知JavaScript库执行响应操作对浏览器页面DOM树进行监听。Further, in step 3), the client sends the corresponding parameters specified by the browser test component by calling different interfaces provided by the browser test component, and the browser test component receives the parameters, parses the client request, and then sends a JS command notification through HttpProxy The JavaScript library performs response operations to monitor the DOM tree of the browser page.
进一步地,步骤3)中,获取即时信息包括以下步骤:Further, in step 3), obtaining instant information includes the following steps:
1‐1)登陆跳转成功以后,等待页面元素加载完毕,模拟用户行为点击进入QQ群组界面。1‐1) After successful login and redirection, wait for the page elements to be loaded, simulate user behavior and click to enter the QQ group interface.
1‐2)通过任务调度策略(可根据实际需要自行设定策略)分配单个群的时间片,按策略进行轮询,监听并获取新消息。1-2) Allocate the time slice of a single group through the task scheduling strategy (the strategy can be set according to actual needs), poll according to the strategy, monitor and obtain new messages.
1‐3)直到所有群轮询完毕,重复步骤1-2)。1‐3) Repeat steps 1-2) until all groups are polled.
进一步地,步骤3)中,获取非即时信息包括以下步骤:Further, in step 3), obtaining non-instant information includes the following steps:
2‐1)登陆跳转成功以后,等待页面元素加载完毕,获取页面中当前账号已经加入的群列表,通过模拟用户点击,完整加载所有群。2‐1) After the login jump is successful, wait for the page elements to be loaded, obtain the list of groups that the current account has joined on the page, and fully load all groups by simulating user clicks.
2‐2)根据群列表进行轮询,首先进入群成员列表页面。在此页面,通过页面元素定位可获取群公告和群成员列表。判断群公告是否有更新,如有,则写入数据库,若没有,则不做写入数据库操作。判断群成员列表变动,如有新成员加入,则写入数据库,若有成员已退出该群,则修改数据库中该成员状态。2‐2) To poll according to the group list, first enter the group member list page. On this page, group announcements and group member lists can be obtained by locating page elements. Determine whether the group announcement has been updated, if so, write it into the database, if not, do not write into the database. It is judged that the list of group members has changed. If a new member joins, it will be written into the database. If a member has left the group, the status of the member in the database will be modified.
2‐3)跳转进入群共享文件页面,通过页面元素定位获取共享文件信息,判断是否有更新,有则写入数据库,没有则跳过。2-3) Jump to the group shared file page, obtain the shared file information through page element positioning, judge whether there is an update, write it into the database if there is one, and skip it if not.
2‐4)跳转进入群相册页面,通过页面元素定位获取群相册信息,判断是否有更新,有则写入数据库,没有则跳过。2-4) Jump to the group album page, obtain the group album information through page element positioning, judge whether there is an update, write it into the database if there is one, and skip it if not.
2‐5)若群列表未轮询完毕,则跳转至步骤2-2)。若群列表轮询完毕,则关闭浏览器。2-5) If the polling of the group list has not been completed, then jump to step 2-2). If the polling of the group list is completed, the browser is closed.
进一步地,步骤2-2)、2-3)和2-4)中所述页面元素定位的方式包括XPath方式或CSS选择器方式。由于XPath是一种用来确定XML文中元素的语言,而DOM树是XML的一种树形结构,其可以定位节点的功能更加全面,优选采用XPath进行页面元素定位。Further, the ways of locating the page elements in steps 2-2), 2-3) and 2-4) include XPath way or CSS selector way. Since XPath is a language used to determine elements in an XML document, and a DOM tree is a tree structure of XML, which has a more comprehensive function of locating nodes, it is preferable to use XPath to locate page elements.
一种基于浏览器测试组件的QQ群数据获取系统,包括:A QQ group data acquisition system based on browser testing components, including:
浏览器测试组件,用于通过端口启动浏览器,并模拟用户登陆Web端QQ;The browser test component is used to start the browser through the port and simulate the user to log in to the QQ on the Web side;
数据采集模块,用于通过监听已登陆QQ的页面元素,获取即时信息和非即时信息;The data collection module is used to obtain instant information and non-instant information by monitoring the page elements that have logged into QQ;
数据存储模块,用于存储获取的QQ群数据。The data storage module is used for storing the obtained QQ group data.
本发明的积极效果如下:The positive effect of the present invention is as follows:
本发明基于浏览器测试组件,结合并行化设计,面向QQ群实现即时聊天信息和非即时信息的数据获取。本发明能够避免人工登陆操作,避免相当部分的人工操作行为。The invention is based on the browser test component, combined with the parallel design, and realizes data acquisition of instant chat information and non-instant information for QQ groups. The present invention can avoid manual login operation and avoid a considerable part of manual operation.
传统的基于浏览器测试组件的采集需要打开浏览器UI,系统资源消耗较大,减缓了加载进程从而影响了采集效率,同时由于巨大的资源开销影响了并行化采集的规模。通过操作无界面浏览器(headlessbrowser)的基于浏览器测试组件的社交网络数据获取,减少系统资源消耗,提高一定硬件资源条件下并行化采集点的数量,并缩减不必要的时间开支,减少系统资源开销,加快数据采集的速度和规模。然后设定监听机制监听DOM元素变化进行数据的定位和解析等操作获取数据。配合并行化分布式技术和消息总线任务调度,实现大规模不同平台的社交平台数据获取。The traditional collection of browser-based test components needs to open the browser UI, which consumes a lot of system resources, slows down the loading process and affects the collection efficiency, and at the same time, the scale of parallel collection is affected by the huge resource overhead. By operating the headless browser (headless browser) based on browser test component social network data acquisition, reduce system resource consumption, increase the number of parallel collection points under certain hardware resource conditions, reduce unnecessary time expenditure, and reduce system resources Overhead, accelerating the speed and scale of data acquisition. Then set the monitoring mechanism to monitor the changes of DOM elements to perform operations such as data positioning and analysis to obtain data. Cooperate with parallel distributed technology and message bus task scheduling to achieve large-scale social platform data acquisition on different platforms.
基于浏览器测试组件实现用户行为模拟,包括点击、输入、拖拽等行为操作。由此方式实现Web端QQ等登陆操作,此方法相比QQ客户端破解,避免了破解周期长、破解成本大的问题,实现难度低,不需要逆向分析和大量的调试。相比基于AJAX技术等的通信过程模拟,它不但避免了复杂的底层请求过程分析,减少了抓包分析过程,因为通常情况下,通信过程中一次数据求情可能包含多个前后依赖,且部分请求格式参数由服务器随机产生,难以确定产生的算法,在模拟请求过程中存在被服务器检测出异常并冻结的可能性;而且避免了协议版本的更新所带来的后续系统代码的大规模修改,能够根据版本变化仅需要修改用户操作逻辑即可适应新版本,快速有效投入实际数据获取应用。Realize user behavior simulation based on browser test components, including click, input, drag and drop and other behavioral operations. In this way, the login operations such as QQ on the Web side are realized. Compared with the QQ client cracking, this method avoids the problems of long cracking cycle and high cracking cost, and the implementation difficulty is low, and no reverse analysis and a large number of debugging are required. Compared with the communication process simulation based on AJAX technology, etc., it not only avoids the complicated analysis of the underlying request process, but also reduces the packet capture analysis process, because usually, a data request in the communication process may contain multiple front and back dependencies, and some requests The format parameters are randomly generated by the server, and it is difficult to determine the generated algorithm. During the simulation request process, there is a possibility that the server may detect anomalies and freeze them; and avoid large-scale modification of subsequent system codes brought about by the update of the protocol version, and can According to the version change, it only needs to modify the user operation logic to adapt to the new version, and put it into the actual data acquisition application quickly and effectively.
此系统不依赖于WebQQ的底层通信协议,它能跨版本运行。This system does not depend on the underlying communication protocol of WebQQ, and it can run across versions.
本发明不依赖于较底层的协议,伴随版本升级能够快速实现升级,调整获取方式,适应新版本进行数据获取。The present invention does not depend on the lower-level protocol, and can quickly realize the upgrade along with the version upgrade, adjust the acquisition mode, and adapt to the new version for data acquisition.
本发明针对WebQQ的特性,能够较快获得QQ相关信息,同时本发明支持跨平台下数据获取,能够适应不同的系统环境。According to the characteristics of WebQQ, the present invention can quickly obtain QQ related information, and at the same time, the present invention supports cross-platform data acquisition and can adapt to different system environments.
通过浏览器测试组件不同接口可实现多个浏览器启动,客户端针对每个浏览器启动一个线程,分别发送指令,由于浏览器间是相互独立的,所以浏览器间数据获取并不会干扰。由于采用浏览器测试组件模拟用户行为在进行网页操作上拥有一定的通用行为以及在数据获取上存在一定的通用功能,通过对这些通用行为和功能进行封装,使得在针对社交网络进行数据获取的时候便于复用,减少工程时间开销。目前能够并发获取多个群信息。Multiple browsers can be started through different interfaces of the browser test component. The client starts a thread for each browser and sends instructions separately. Since the browsers are independent of each other, data acquisition between browsers will not interfere. Since the use of browser test components to simulate user behavior has certain general behaviors in web page operations and certain general functions in data acquisition, by encapsulating these general behaviors and functions, it is possible to obtain data for social networks. It is easy to reuse and reduce engineering time overhead. Currently, multiple group information can be obtained concurrently.
此方法突出了“所见即所得”的特点,只要网页上能看见的,系统就能获取到,而且此方法只依赖与网页元素的结构,只要结构不变,后续就无需作大规模修改,维护方便。This method highlights the feature of "what you see is what you get". As long as it can be seen on the web page, the system can obtain it. Moreover, this method only depends on the structure of web page elements. As long as the structure remains unchanged, there is no need for large-scale modification in the future. Easy maintenance.
综上所述,本发明在快速获取数据的同时,兼顾了获取数据的完整性,同时在后期维护上成本投入更小,能够根据版本变化快速修改并投入使用。To sum up, the present invention not only quickly acquires data, but also takes into account the integrity of the acquired data, and at the same time requires less investment in later maintenance, and can be quickly modified and put into use according to version changes.
附图说明Description of drawings
图1本发明实施例中QQ群数据采集流程图。Fig. 1 is a flow chart of QQ group data collection in the embodiment of the present invention.
图2本发明实施例基于浏览器测试组件的QQ群数据获取系统框架图。FIG. 2 is a frame diagram of a QQ group data acquisition system based on a browser testing component according to an embodiment of the present invention.
图3本发明实施例web端SmartQQ页面示意图。Fig. 3 is a schematic diagram of a web-side SmartQQ page according to an embodiment of the present invention.
图4本发明实施例登陆web端SmartQQ页面的示意图。Fig. 4 is a schematic diagram of logging in to the SmartQQ page at the web end according to the embodiment of the present invention.
图5本发明实施例群组界面示意图。Fig. 5 is a schematic diagram of a group interface according to an embodiment of the present invention.
图6本发明实施例某群的群消息示意图。FIG. 6 is a schematic diagram of a group message of a certain group according to an embodiment of the present invention.
图7本发明实施例控制台中输出获取图6所示信息的示意图。Fig. 7 is a schematic diagram of outputting and obtaining the information shown in Fig. 6 in the console according to the embodiment of the present invention.
图8是本发明实施例网页中的群列表示意图。Fig. 8 is a schematic diagram of a group list in a webpage according to an embodiment of the present invention.
图9是本发明实施例群成员列表示意图。Fig. 9 is a schematic diagram of a list of group members according to an embodiment of the present invention.
图10是本发明实施例群共享文件示意图。Fig. 10 is a schematic diagram of a group shared file according to an embodiment of the present invention.
图11是本发明实施例群相册示意图。Fig. 11 is a schematic diagram of a group photo album according to an embodiment of the present invention.
图12是本发明实施例中QQ群数据采集并行化设计图。Fig. 12 is a design diagram of parallelization of QQ group data collection in the embodiment of the present invention.
具体实施方式Detailed ways
本发明中QQ群数据采集的流程如图1所示,系统框架图如图2所示,其中QQ群数据的获取包括包括即时信息获取和非即时信息获取。具体如下:The flow of QQ group data collection in the present invention is shown in Figure 1, and the system frame diagram is shown in Figure 2, wherein the acquisition of QQ group data includes instant information acquisition and non-instant information acquisition. details as follows:
1)基于浏览器测试组件的即时信息获取,具体可分为以下几个步骤:1) The instant information acquisition based on the browser test component can be divided into the following steps:
a)通过浏览器测试组件的接口,启动浏览器,跳转进入Web端QQ页面。如图3所示,打开SmartQQ(此为web端qq的一个版本)。a) Through the interface of the browser test component, start the browser, and jump to the QQ page on the Web side. As shown in Figure 3, open SmartQQ (this is a version of qq on the web side).
b)模拟用户行为,输入采集账号和密码,判断是否需要人工输入验证码,若需要,则提示操作人员进行人工输入(Web端QQ采用的验证码经过处理,目前难以通过图像对比获取正确的验证码),等以上完毕,点击登陆。如图4所示,自动输入采集账号和密码,进行登陆。b) Simulate user behavior, input and collect account number and password, and judge whether manual input of verification code is required, and if necessary, prompt the operator to manually input (the verification code used by QQ on the web side has been processed, and it is currently difficult to obtain correct verification through image comparison code), after the above is completed, click Login. As shown in Figure 4, automatically enter the collection account number and password to log in.
c)登陆跳转成功以后,等待页面元素加载完毕,模拟用户行为点击进入群组界面。如图5所示,跳转成功以后,自动点击相应标签(分别为“联系人”->“群”)进入组群页面,可以获取群列表等。c) After the login jump is successful, wait for the page elements to be loaded, simulate user behavior and click to enter the group interface. As shown in Figure 5, after the jump is successful, automatically click on the corresponding label (respectively "Contacts" -> "Group") to enter the group page, and you can get the group list and so on.
d)通过一定的任务调度策略分配单个群的时间片,按策略进行轮询,监听并获取新消息。如图6中的箭头所示,随机一个群中有用户发出聊天信息“233333”,而在控制台输出了获取的到的信息,包括发言者和发言内容以及所属的群,如图7中的箭头所示。d) Allocate the time slice of a single group through a certain task scheduling strategy, poll according to the strategy, monitor and obtain new messages. As shown by the arrow in Figure 6, a user in a random group sent a chat message "233333", and the obtained information was output on the console, including the speaker, the content of the speech, and the group to which he belonged, as shown in Figure 7 indicated by the arrow.
e)直到所有群轮询完毕,重复步骤d)。e) Repeat step d) until all groups are polled.
f)值得一提的是,步骤d)中的任务调度策略可自行选择,一般情况下基于时间片轮询的方式,为降低数据获取的延迟,可根据群信息产生量按一定的策略或比例进行时间片分配。f) It is worth mentioning that the task scheduling strategy in step d) can be chosen by oneself. Generally, it is based on the time slice polling method. In order to reduce the delay of data acquisition, it can be based on a certain strategy or ratio according to the amount of group information generated. Allocate time slices.
2)基于浏览器测试组件的非即时信息获取,具体可分为以下几个步骤:2) Acquisition of non-instant information based on the browser test component can be specifically divided into the following steps:
a)通过浏览器测试组件组件端口,启动浏览器,跳转进入Web端QQ群主页。a) Test the component port of the component through the browser, start the browser, and jump to the homepage of the QQ group on the Web side.
b)模拟用户行为,输入采集账号和密码,判断是否需要人工输入验证码,若需要,则提示操作人员进行人工输入(Web端QQ采用的验证码经过处理,目前难以通过图像对比获取正确的验证码),等以上完毕,点击登陆。b) Simulate user behavior, input and collect account number and password, and judge whether manual input of verification code is required, and if necessary, prompt the operator to manually input (the verification code used by QQ on the web side has been processed, and it is currently difficult to obtain correct verification through image comparison code), after the above is completed, click Login.
c)登陆跳转成功以后,等待页面元素加载完毕,获取页面中当前采集账号已经加入的群列表,查看页面是否存在“更多群”等标签,通过模拟用户点击,完整加载所有群,如图8所示。c) After the login jump is successful, wait for the page elements to be loaded, obtain the list of groups that the current collection account has joined on the page, check whether there are labels such as "more groups" on the page, and fully load all groups by simulating user clicks, as shown in the figure 8.
d)根据群列表进行轮询,首先进入群成员列表页面,如图9所示。在此页面,通过XPath元素定位,可获取群公告和群成员列表。判断群公告是否有更新,如有,则写入数据库,若没有,则不做写入数据库操作。判断群成员列表变动,如有新成员加入,则写入数据库,若有成员已退出该群,则修改数据库中该成员状态。d) Perform polling according to the group list, first enter the group member list page, as shown in Figure 9. On this page, the group announcement and group member list can be obtained by locating the XPath element. Determine whether the group announcement has been updated, if so, write it into the database, if not, do not write into the database. It is judged that the list of group members has changed. If a new member joins, it will be written into the database. If a member has left the group, the status of the member in the database will be modified.
e)跳转进入群共享文件页面,如图10所示,通过XPath进行页面元素定位,获取共享文件信息,判断是否有更新,有则写入数据库,没有则跳过。e) Jump to the group shared file page, as shown in Figure 10, locate the page elements through XPath, obtain the shared file information, judge whether there is an update, write it into the database if there is one, and skip it if not.
f)跳转进入群相册页面,如图11所示,通过XPath进行页面元素定位,获取群相册信息,判断是否有更新,有则写入数据库,没有则跳过。f) Jump to the group album page, as shown in Figure 11, use XPath to locate page elements, obtain group album information, judge whether there is an update, write it into the database if there is one, and skip it if not.
g)若群列表未轮询完毕,则跳转至步骤d)。若群列表轮询完毕,则关闭浏览器,切换采集账号,跳转至步骤a)。g) If the polling of the group list has not been completed, then jump to step d). If the polling of the group list is completed, close the browser, switch the collection account, and jump to step a).
此外,本发明中QQ群数据采集采用并行化设计,如图12所示。浏览器测试组件提供一个称为Hub的组件,通过Hub组件接受客户端代码请求,可分别启动彼此隔离的浏览器进程,接着此后的客户端代码请求均由Hub组件通过启动独立进程时候配置的sessionID进行命令分发,分别和不同浏览器页面DOM树进行交互。以此实现并行化数据获取。而消息总线主要由按需编写的客户端程序实现,包括采集账号的获取和分配,通过把采集账号分配到不同线程进行数据采集。In addition, the QQ group data collection in the present invention adopts a parallel design, as shown in FIG. 12 . The browser test component provides a component called Hub, which accepts client code requests through the Hub component, and can start browser processes that are isolated from each other, and then subsequent client code requests are made by the Hub component through the sessionID configured when starting an independent process Distribute commands and interact with the DOM tree of different browser pages respectively. In this way, parallel data acquisition is realized. The message bus is mainly implemented by client programs written on demand, including the acquisition and distribution of collection accounts, and data collection is performed by assigning collection accounts to different threads.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510363954.3A CN105095070B (en) | 2015-04-03 | 2015-06-26 | QQ group's data capture method and system based on browser testing component |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2015101562301 | 2015-04-03 | ||
CN201510156230 | 2015-04-03 | ||
CN201510363954.3A CN105095070B (en) | 2015-04-03 | 2015-06-26 | QQ group's data capture method and system based on browser testing component |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105095070A true CN105095070A (en) | 2015-11-25 |
CN105095070B CN105095070B (en) | 2017-12-19 |
Family
ID=54575565
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510363954.3A Active CN105095070B (en) | 2015-04-03 | 2015-06-26 | QQ group's data capture method and system based on browser testing component |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105095070B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107257314A (en) * | 2017-06-05 | 2017-10-17 | 成都知道创宇信息技术有限公司 | A kind of message statistics analysis method based on wechat group |
CN108287906A (en) * | 2018-01-28 | 2018-07-17 | 江苏快页信息技术有限公司 | A kind of public sentiment monitoring method based on instant messaging social software |
CN109446392A (en) * | 2018-09-03 | 2019-03-08 | 中新网络信息安全股份有限公司 | A kind of webpage capture system and grasping means based on no interface browser and configurable agent intercepts |
CN115174152A (en) * | 2022-06-08 | 2022-10-11 | 中国科学院信息工程研究所 | Group test authentication encryption method, verification decryption method and communication method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7200804B1 (en) * | 1998-12-08 | 2007-04-03 | Yodlee.Com, Inc. | Method and apparatus for providing automation to an internet navigation application |
CN101221572A (en) * | 2008-01-25 | 2008-07-16 | 吴坤达 | Web page data processing system |
CN102014078A (en) * | 2010-09-28 | 2011-04-13 | 苏州阔地网络科技有限公司 | Method for realizing instant messaging based on flash on webpage |
CN102033803A (en) * | 2009-09-29 | 2011-04-27 | 国际商业机器公司 | Method and device for testing web application across browsers |
CN102316049A (en) * | 2010-07-02 | 2012-01-11 | 苏州阔地网络科技有限公司 | Method for automatically receiving group message |
CN103067214A (en) * | 2011-10-19 | 2013-04-24 | 阿里巴巴集团控股有限公司 | Method, client, server and system used for testing web site performance |
CN104111852A (en) * | 2014-07-18 | 2014-10-22 | 南京富士通南大软件技术有限公司 | Web application automated testing system and testing method based on data drive |
-
2015
- 2015-06-26 CN CN201510363954.3A patent/CN105095070B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7200804B1 (en) * | 1998-12-08 | 2007-04-03 | Yodlee.Com, Inc. | Method and apparatus for providing automation to an internet navigation application |
CN101221572A (en) * | 2008-01-25 | 2008-07-16 | 吴坤达 | Web page data processing system |
CN102033803A (en) * | 2009-09-29 | 2011-04-27 | 国际商业机器公司 | Method and device for testing web application across browsers |
CN102316049A (en) * | 2010-07-02 | 2012-01-11 | 苏州阔地网络科技有限公司 | Method for automatically receiving group message |
CN102014078A (en) * | 2010-09-28 | 2011-04-13 | 苏州阔地网络科技有限公司 | Method for realizing instant messaging based on flash on webpage |
CN103067214A (en) * | 2011-10-19 | 2013-04-24 | 阿里巴巴集团控股有限公司 | Method, client, server and system used for testing web site performance |
CN104111852A (en) * | 2014-07-18 | 2014-10-22 | 南京富士通南大软件技术有限公司 | Web application automated testing system and testing method based on data drive |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107257314A (en) * | 2017-06-05 | 2017-10-17 | 成都知道创宇信息技术有限公司 | A kind of message statistics analysis method based on wechat group |
CN108287906A (en) * | 2018-01-28 | 2018-07-17 | 江苏快页信息技术有限公司 | A kind of public sentiment monitoring method based on instant messaging social software |
CN109446392A (en) * | 2018-09-03 | 2019-03-08 | 中新网络信息安全股份有限公司 | A kind of webpage capture system and grasping means based on no interface browser and configurable agent intercepts |
CN115174152A (en) * | 2022-06-08 | 2022-10-11 | 中国科学院信息工程研究所 | Group test authentication encryption method, verification decryption method and communication method |
CN115174152B (en) * | 2022-06-08 | 2024-06-18 | 中国科学院信息工程研究所 | A group test authentication encryption method, verification and decryption method and communication method |
Also Published As
Publication number | Publication date |
---|---|
CN105095070B (en) | 2017-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107895009B (en) | Distributed internet data acquisition method and system | |
US10515005B1 (en) | Systems and methods for testing source code | |
CN102306120B (en) | Measure actual final user's performance and the availability of WEB application program | |
US10296563B2 (en) | Automated testing of perceptible web page elements | |
WO2018120721A1 (en) | Method and system for testing user interface, electronic device, and computer readable storage medium | |
US8898643B2 (en) | Application trace replay and simulation systems and methods | |
US20210081308A1 (en) | Generating automated tests based on user interaction with an application | |
WO2018126964A1 (en) | Task execution method and apparatus and server | |
US10893091B2 (en) | Management of asynchronous content post and media file transmissions | |
CN107766509B (en) | Method and device for static backup of webpage | |
CN106815141A (en) | A kind of method for testing software and device | |
JP2013508805A (en) | Data update for website users based on preset conditions | |
CN109840298B (en) | Multi-information source collection method and system for large-scale network data | |
CN105095070B (en) | QQ group's data capture method and system based on browser testing component | |
Wong et al. | Design of a crawler for online social networks analysis | |
CN106815142A (en) | A kind of method for testing software and system | |
CN105162676A (en) | Method and system for acquiring WeChat data | |
Sivakumar et al. | Nutshell: Scalable whittled proxy execution for low-latency web over cellular networks | |
CN115422063A (en) | Low-code interface automation system, electronic equipment and storage medium | |
CN113934913A (en) | Data capture method and device, storage medium and electronic equipment | |
US10198537B2 (en) | Method and system for implementing intelligent system diagrams | |
US12204938B2 (en) | Pipeline-based machine learning method and apparatus, electronic device, and computer readable storage medium | |
CN112559525A (en) | Data checking system, method, device and server | |
Tang et al. | Application centric lifecycle framework in cloud | |
US20240004734A1 (en) | Event processing systems and methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |