ES2719123B2

ES2719123B2 - PROCEDURE AND SYSTEM FOR THE CLASSIFICATION AND DETECTION OF THE MOST INFLUENTIAL DOMAINS WITHIN THE DARK TOR NETWORK

Info

Publication number: ES2719123B2
Application number: ES201831145A
Authority: ES
Inventors: Wesam Al-Nabki Mhd; Fernández Eduardo Fidalgo; Gutiérrez Enrique Alegre; Robles Laura Fernández
Original assignee: Universidad de Leon
Current assignee: Universidad de Leon
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-09-10
Anticipated expiration: 2038-11-26
Also published as: ES2719123A1

Description

DESCRIPCIÓNDESCRIPTION

PROCEDIMIENTO Y SISTEMA PARA LA CLASIFICACIÓN Y DETECCIÓN DE LOSPROCEDURE AND SYSTEM FOR THE CLASSIFICATION AND DETECTION OF THE

DOMINIOS MÁS INFLUYENTES DENTRO DE LA RED OSCURA TORMOST INFLUENTIAL DOMAINS WITHIN THE DARK TOR NETWORK

OBJETO DE LA INVENCIÓNOBJECT OF THE INVENTION

El objeto de la presente invención es un procedimiento y sistema automatizado para clasificar y detectar los dominios más influyentes dentro de la red oscura Tor (The Onion Router) en función de los hipervínculos que contengan. La invención permite identificar qué dominio es el más influyente dentro de la red, es decir, el dominio cuya eliminación provocaría la mayor desestabilización de la red. Dicha desestabilización afectaría a una reducción muy elevada u obstrucción en la transmisión de información a través de los diferentes dominios. Esta invención, también permite identificar las categorías de los diferentes dominios analizados, e identificar el dominio más influyente por categoría.The object of the present invention is an automated procedure and system to classify and detect the most influential domains within the Tor darknet (The Onion Router) based on the hyperlinks they contain. The invention makes it possible to identify which domain is the most influential within the network, that is, the domain whose removal would cause the greatest destabilization of the network. Said destabilization would affect a very high reduction or obstruction in the transmission of information through the different domains. This invention also makes it possible to identify the categories of the different domains analyzed, and to identify the most influential domain by category.

ANTECEDENTES DE LA INVENCIÓNBACKGROUND OF THE INVENTION

En la actualidad, el modo de acceder a la información en Internet es a través del uso de motores de búsqueda, como Google o Bing. A pesar de que son eficientes y muy potentes, no pueden indexar todo el contenido de la Web. El contenido que está indexado pertenece a la Web Superficial, y resto del contenido, que no está indexado, pertenece a la Web Profunda. Dentro de la Web Profunda, hay una parte formada por varias redes y se denomina Red Oscura, siendo Tor (The Onion Router) la red oscura más famosa, debido al nivel de anonimidad que proporciona a sus usuarios.Currently, the way to access information on the Internet is through the use of search engines, such as Google or Bing. Although they are efficient and very powerful, they cannot index all the content on the web. The content that is indexed belongs to the Surface Web, and the rest of the content, which is not indexed, belongs to the Deep Web. Within the Deep Web, there is a part made up of several networks and is called the Dark Network, with Tor (The Onion Router) being the most famous dark network, due to the level of anonymity it provides to its users.

Debido a la topología de Tor y al hecho de que no es posible registrar el tráfico de los dominios, no es posible establecer una medición de la popularidad o influencia de dichos dominios dentro de la red. Además, dada la anonimidad que proporciona, la red oscura Tor aloja dominios con diferentes contenidos, tanto legales como ilegales. A raíz del contenido ilegal de Tor, surge la necesidad de clasificar dicho contenido en diferentes tipos de actividades ilegales e identificar qué dominios son los más influyentes dentro de la red, tanto a nivel global como para cada categoría. La eliminación de los dominios más influyentes desestabilizaría la red y dificultaría la transmisión de información a través de los diferentes dominios.Due to the Tor topology and the fact that it is not possible to record domain traffic, it is not possible to establish a measure of the popularity or influence of such domains within the network. Furthermore, given the anonymity it provides, the Tor darknet hosts domains with different content, both legal and illegal. As a result of illegal content on Tor, the need arises to classify said content in different types of illegal activities and identify which domains are the most influential within the network, both globally and for each category. The elimination of the most influential domains would destabilize the network and make it difficult to transmit information through the different domains.

La clasificación manual de los dominios presenta una serie de inconvenientes. En primer lugar, debido al elevado número de posibles dominios, del orden de decenas de miles, se requiere de una alta inversión de tiempo y personal para realizar la clasificación. Además, la clasificación depende en gran medida de la persona que la realiza, aportando subjetividad que genera una disparidad de criterio entre expertos. También, la clasificación de dominios no siempre es fiable, ya que frecuentemente se producen errores derivados por el cansancio y la falta de atención del experto. Por último, es un proceso costoso por los elevados costes asociados al tiempo de la persona que realiza la clasificación.Manual classification of domains has a number of drawbacks. In the first place, due to the high number of possible domains, of the order of tens of thousands, it requires a high investment of time and personnel to carry out the classification. In addition, the classification depends to a great extent on the person who performs it, providing subjectivity that generates a disparity of criteria among experts. Also, the domain classification is not always reliable, since errors frequently occur due to fatigue and lack of expert attention. Finally, it is an expensive process due to the high costs associated with the time of the person who performs the classification.

Debido a las anteriores características de la red oscura Tor, solo sería posible utilizar el contenido textual de sus dominios para realizar una medida de la influencia de los mismos, dado que determinados contenidos visuales serían ilegales según la normativa española, como la pornografía infantil. Teniendo en cuenta esta limitación, se recurre al análisis de la relevancia de los dominios basada en teoría de grafos, donde los dominios serán los nodos y los hipervínculos entre ellos los enlaces. Según varios estudios (K. Taha and P. Yoo, “Using the Spanning Tree of a Criminal Network for Identifying Its Leaders”, IEEE Transactions on Information Forensics and Security, vol. 12, no. 2, pp. 445-453, 2017), para poder desestabilizar una red es necesario eliminar los nodos más influyentes del grafo, para así conseguir una reducción significativa del flujo de información a través del grafo.Due to the previous characteristics of the dark network Tor, it would only be possible to use the textual content of your domains to measure their influence, since certain visual content would be illegal according to Spanish regulations, such as child pornography. Bearing this limitation in mind, the analysis of the relevance of the domains based on graph theory is used, where the domains will be the nodes and the hyperlinks between them the links. According to various studies (K. Taha and P. Yoo, “Using the Spanning Tree of a Criminal Network for Identifying Its Leaders”, IEEE Transactions on Information Forensics and Security, vol. 12, no. 2, pp. 445-453, 2017 ), in order to destabilize a network it is necessary to eliminate the most influential nodes of the graph, in order to achieve a significant reduction in the flow of information through the graph.

La medición de la influencia o relevancia de los nodos dentro de un grafo empezó a utilizarse hace más de 60 años. Inicialmente se trabajó con medidas de centralidad (Bavelas, A. (1948). «A mathematical model for group structures». Human Organization 7: 16-30.), que con el tiempo fueron evolucionando a algoritmos más complejos, como Katz (Katz, L. (1953). A New Status Index Derived from Sociometric Analysis. Psychometrika, 39-43.), HIST (Jon M. Kleinberg. 1999. Hubs, authorities, and communities. ACM Comput. Surv. 31, 4es, Article 5, December 1999) o PageRank (US6285999B1, 1997).The measurement of the influence or relevance of nodes within a graph began to be used more than 60 years ago. Initially, we worked with measures of centrality (Bavelas, A. (1948). «A mathematical model for group structures». Human Organization 7: 16-30.), Which over time evolved to more complex algorithms, such as Katz (Katz , L. (1953). A New Status Index Derived from Sociometric Analysis. Psychometrika, 39-43.), HIST (Jon M. Kleinberg. 1999. Hubs, authorities, and communities. ACM Comput. Surv. 31, 4es, Article 5, December 1999) or PageRank (US6285999B1, 1997).

A pesar de que los anteriores algoritmos de medición de influencia consiguen resultados aceptables a la hora de desestabilizar una red, no se ha descrito específicamente un método y un sistema para la clasificación y posterior detección de los dominios más influyentes en la red oscura Tor.Although the previous influence measurement algorithms achieve acceptable results when destabilizing a network, a method and a system for the classification and subsequent detection of the most influential domains in the Tor dark network has not been specifically described.

DESCRIPCIÓN DE LA INVENCIÓNDESCRIPTION OF THE INVENTION

El objeto de la presente invención es un procedimiento y sistema automatizado para clasificar y detectar los dominios más influyentes dentro de la red oscura Tor (The Onion Router) en función de los hipervínculos que existan entre ellos. Los hipervínculos son, por tanto, propios de los dominios.The object of the present invention is an automated procedure and system to classify and detect the most influential domains within the Tor darknet (The Onion Router) in depending on the hyperlinks that exist between them. Hyperlinks are, therefore, specific to domains.

El procedimiento y sistema para la clasificación y detección de los dominios más influyentes dentro de la red oscura Tor (The Onion Router) de la presente invención permite clasificar y ordenar automáticamente grandes repositorios de dominios obtenidos mediante tecnología digital (ordenador) con acceso a internet, y conectado a la red Tor para recopilar información textual.The procedure and system for the classification and detection of the most influential domains within the Tor dark network (The Onion Router) of the present invention allows to automatically classify and order large repositories of domains obtained through digital technology (computer) with internet access, and connected to the Tor network to collect textual information.

La clasificación automática, como paso intermedio a la detección de dominios influyentes, frente a la manual por un experto anula la subjetividad, los errores por cansancio y falta de atención, la disparidad de criterio entre expertos, los costes asociados al tiempo del experto, disminuye el tiempo necesario para la clasificación y aumenta la fiabilidad del etiquetado. Por este motivo, este procedimiento puede ser implementado en herramientas utilizadas por empresas y FFCCSSEE (Fuerzas y Cuerpos de Seguridad del Estado) para realizar una clasificación de los dominios de la red oscura Tor en diferentes categorías, y su posterior detección de los dominios más influyentes, tanto dentro de la red Tor como en cada categoría ilegal seleccionada.Automatic classification, as an intermediate step to the detection of influential domains, compared to manual by an expert, cancels subjectivity, errors due to fatigue and inattention, the disparity of criteria between experts, the costs associated with the expert's time, decreases the time required for sorting and increases the reliability of labeling. For this reason, this procedure can be implemented in tools used by companies and FFCCSSEE (State Security Forces and Bodies) to classify the Tor darknet domains into different categories, and their subsequent detection of the most influential domains both within the Tor network and in each selected illegal category.

La presente invención puede ser también aplicada en el entrenamiento o aprendizaje a distancia de personal especializado en las diferentes categorías consideradas como ilegales. La disposición de grandes conjuntos de dominios ya clasificados y las posibilidades actuales para recolectar nuevos dominios y enviarlos a un sistema de forma remota, permitiría que personal de FFCCSSEE o de empresa mejoren su conocimiento sobre las diferentes categorías ilegales presentes en la red oscura Tor, lo que aumentaría su conocimiento a la hora de diferenciarlas de otras categorías legales presentes en dicha red.The present invention can also be applied in the training or distance learning of specialized personnel in the different categories considered illegal. The provision of large sets of already classified domains and the current possibilities to collect new domains and send them to a system remotely, would allow FFCCSSEE or company personnel to improve their knowledge about the different illegal categories present in the Tor darknet, thus that would increase their knowledge when it comes to differentiating them from other legal categories present in said network.

Dentro del proceso previo al procedimiento de la invención, el sistema permitiría realizar la clasificación automática de dominios mediante la codificación de texto a través de la Frecuencia de Términos - Frecuencia Inversa del Documento (TF-IDF, “Term Frequency -Inverse Document Frequency), métrica que indica la relevancia de una palabra en un documento, y su posterior clasificación con Regresión Logística (LR, “Logistic Regression”). Los dominios clasificados pertenecen a las siguientes categorías: (i) pornografía, (ii) criptomoneda, (iii) contrabando de tarjetas de crédito, (iv) venta de drogas ilegales, (v) actividades violentas, (vi) ataques cibernéticos (hacking), (vii) falsificación de moneda, (viii) contrabando de identificación personal y (ix) otros.Within the process prior to the procedure of the invention, the system would allow the automatic classification of domains by means of text encoding through the Frequency of Terms - Inverse Document Frequency (TF-IDF, “Term Frequency -Inverse Document Frequency), metric that indicates the relevance of a word in a document, and its subsequent classification with Logistic Regression (LR, “Logistic Regression”). The classified domains fall into the following categories: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking) , (vii) counterfeiting of currency, (viii) smuggling of personal identification and (ix) others.

A continuación, el procedimiento de la invención permite realizar la medida de la influencia de los dominios ilegales tanto (i) a nivel global como (ii) a nivel de cada una de las categorías anteriores. Para ello se construye un Grafo de Actividades Interesantes, donde cada dominio es representado por un nodo, y los enlaces del grafo provienen de los diferentes hipervínculos entrantes y salientes contenidos en los anteriores dominios. Por tanto, los enlaces son propios de los nodos y existen tantos enlaces entrantes al nodo como hipervínculos le apuntan a él y tantos enlaces salientes del nodo como él apunta a otros hipervínculos. En este proceso se eliminan los enlaces duplicados, es decir, aquellos que tienen el mismo origen y destino, para evitar la creación de multigrafos, y también aquellos enlaces a la Web Superficial. A continuación, se aplica el algoritmo que permite identificar cuales son los dominios más influyentes dentro de toda la red, y también por cada categoría, cuya eliminación afectaría al flujo de información dentro de la red.Next, the procedure of the invention makes it possible to measure the influence of illegal domains both (i) at the global level and (ii) at the level of each of the previous categories. For this, a Graph of Interesting Activities is constructed, where each domain is represented by a node, and the links in the graph come from the different incoming and outgoing hyperlinks contained in the previous domains. Therefore, the links are specific to the nodes and there are as many incoming links to the node as there are hyperlinks pointing to it and as many outgoing links from the node as it points to other hyperlinks. In this process, duplicate links are eliminated, that is, those that have the same origin and destination, to avoid creating multigraphs, and also those links to the Superficial Web. Next, the algorithm is applied to identify which are the most influential domains within the entire network, and also for each category, whose elimination would affect the flow of information within the network.

En una realización preferente de la invención, este procedimiento se aplica a dominios ilegales de la red Tor, tanto a nivel global como por las diferentes categorías, aunque se puede aplicar sin realizar la categorización de actividades ilegales. También se puede extender a dominios legales de la misma red, a su aplicación en la Web Superficial, a redes de contactos y, en general, a cualquier tipo de red donde haya enlaces entrantes y salientes entre sus diferentes elementos.In a preferred embodiment of the invention, this procedure is applied to illegal domains of the Tor network, both globally and by different categories, although it can be applied without categorizing illegal activities. It can also be extended to legal domains on the same network, to its application on the Superficial Web, to contact networks and, in general, to any type of network where there are incoming and outgoing links between its different elements.

En la presente descripción, antes de codificar el texto para su posterior clasificación, este es preprocesado. Tras un rastreo de todos los dominios de Tor, se descargan los recursos de aquellos que están activos y se extrae su fichero HTML. A continuación, se seleccionan aquellos dominios que están en inglés y se eliminan las etiquetas del lenguaje HTML, caracteres especiales y palabras vacías. En la presente invención, el término “texto en bruto” se refiere al que está contenido en el HTML del dominio y donde no se ha aplicado ningún preprocesamiento. Por otro lado, se emplea de manera general el término “texto” para hacer referencia al texto resultante tras el preprocesamiento del “texto en bruto” según la anterior descripción.In the present description, before coding the text for later classification, it is pre-processed. After a scan of all Tor domains, the resources of those that are active are downloaded and their HTML file is extracted. Then those domains that are in English are selected and HTML tags, special characters, and stopwords are removed. In the present invention, the term "raw text" refers to that which is contained in the domain's HTML and where no preprocessing has been applied. On the other hand, the term "text" is generally used to refer to the resulting text after preprocessing the "raw text" according to the previous description.

El procedimiento preferible para la clasificación y posterior detección de los dominios ilegales más influyentes dentro de la red oscura Tor de la presente invención comprende las siguientes etapas: The preferable procedure for the classification and subsequent detection of the most influential illegal domains within the Tor darknet of the present invention comprises the following steps:

1. Rastreo de dominios y descarga de texto en bruto. A partir de una lista pública de dominios de la red Tor, se rastrean dichos dominios y se descarga, para cada dominio que esté activo en el momento del rastreo, su fichero HTML, el cual contiene el texto en bruto. Este rastreo y descarga de dominios se hace a través de un ordenador con conexión a internet y a la red oscura Tor.1. Domain tracking and raw text download. From a public list of domains on the Tor network, these domains are tracked and, for each domain that is active at the time of the crawl, its HTML file is downloaded, which contains the raw text. This domain tracking and downloading is done through a computer with an internet connection and the Tor darknet.

2. Preprocesamiento de texto en bruto: dentro del mismo ordenador, para cada fichero HTML obtenido se realiza el preprocesamiento del texto bruto contenido para obtener el texto. A continuación, y de acuerdo con una realización preferente de la invención, a través de una librería de detección de idioma se seleccionan aquellos dominios que están en inglés, con el objeto de mejorar el posterior entrenamiento del sistema de clasificación, al ser el inglés el lenguaje mayoritariamente usado en la red Tor. En una realización preferente se eliminan las etiquetas del lenguaje HTML, caracteres especiales y palabras vacías, dando lugar al texto final.2. Raw text preprocessing: within the same computer, for each HTML file obtained, the raw text content is preprocessed to obtain the text. Next, and in accordance with a preferred embodiment of the invention, through a language detection library those domains that are in English are selected, in order to improve the subsequent training of the classification system, as English is the language mostly used on the Tor network. In a preferred embodiment, the HTML language tags, special characters and stop words are removed, resulting in the final text.

3. Clasificación del texto: de acuerdo con una realización preferente de la invención, se realiza un proceso de clasificación automática del texto, con el objeto de poder identificar cuáles serán los dominios más influyentes de la red oscura Tor dentro de cada una de las categorías de actividades ilegales que contiene. En una realización preferente, se codifica el texto con la Frecuencia de Términos - Frecuencia Inversa del Documento (TF-IDF, “Term Frequency - Inverse Document Frequency”) y se clasifican los dominios con Regresión Logística (LR, “Logistic Regression”) en las siguientes categorías: (i) pornografía, (ii) criptomoneda, (iii) contrabando de tarjetas de crédito, (iv) venta de drogas ilegales, (v) actividades violentas, (vi) ataques cibernéticos (hacking), (vii) falsificación de moneda, (viii) contrabando de identificación personal y (ix) otros. De acuerdo con una realización preferente de la invención, para la Frecuencia de Términos - Frecuencia Inversa del Documento se utiliza una longitud de vector mínima de tres y máxima de 10000 elementos, y para la clasificación con Regresión Logística se activó el balance de pesos entre clases.3. Text classification: according to a preferred embodiment of the invention, an automatic text classification process is carried out, in order to be able to identify which will be the most influential domains of the Tor dark network within each of the categories. of illegal activities it contains. In a preferred embodiment, the text is encoded with the Term Frequency - Inverse Document Frequency (TF-IDF) and the domains with Logistic Regression (LR, "Logistic Regression") are classified into the following categories: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking), (vii) counterfeiting of currency, (viii) contraband of personal identification and (ix) others. According to a preferred embodiment of the invention, a minimum vector length of three and a maximum of 10,000 elements is used for the Frequency of Terms - Inverse Document Frequency, and for the classification with Logistic Regression, the weight balance between classes was activated. .

4. Construcción del Grafo de Actividades Interesantes: una vez tenemos el texto preparado, se construye el Grafo de Actividades Interesantes para todos los dominios de la red Tor. En una realización preferente de la invención, también se construyen los Grafos de Actividades Interesantes correspondientes a los dominios clasificados en las nueve categorías indicadas previamente. En dicha realización preferente, se asocia cada dominio a los nodos del grafo y los enlaces entre los diferentes nodos se establecen en función de los hipervínculos entrantes y salientes de cada dominio. En la realización preferente, se eliminan los hipervínculos duplicados, es decir, aquellos que tienen el mismo origen y destino.4. Construction of the Interesting Activities Graph: once we have the text ready, the Interesting Activities Graph is constructed for all the domains of the Tor network. In a preferred embodiment of the invention, the Interesting Activity Graphs corresponding to the domains classified in the nine categories previously indicated are also constructed. In said preferred embodiment, each domain is associated with the nodes of the graph and the links between the different nodes are established based on the hyperlinks incoming and outgoing for each domain. In the preferred embodiment, duplicate hyperlinks, that is, those with the same origin and destination, are eliminated.

5.

Cálculo de dominios más influyentes con algoritmo de influencia. Por último, se procede al cálculo de los dominios más influyentes de la red oscura Tor. De acuerdo con una realización preferente de la invención, también se calculan los dominios más influyentes de las diferentes categorías ilegales la red oscura Tor para los que se generaron los anteriores Grafos de Actividades de Interés. Dicho cálculo se lleva a cabo en dos fases. En una primera fase, se aplica el algoritmo de influencia, el algoritmo de medida de la influencia de dominios para calcular el ranking de los diferentes dominios extraídos de la red Tor. El valor de rango de un dominio se obtiene como la combinación ponderada de la suma del número de hipervínculos de los dominios seguidores y seguidos del dominio analizado. Según varios estudios, la desestabilización de una red se consigue eliminando los nodos con un mayor ranking, lo que da lugar a una obstrucción en el flujo de la información a través del grafo. De acuerdo con una realización preferente de la invención, se interpreta la influencia dentro de la red oscura Tor como la cantidad de obstrucción que un nodo puede causar al Grafo de Actividades Interesantes cuando es eliminado. En una segunda fase, se realiza una ordenación descendente de dichos dominios según el valor de rango obtenido, siendo los primeros dominios los de mayor valor y por lo tanto considerados los más influyentes.5.

Calculation of most influential domains with influence algorithm. Finally, the most influential domains of the Tor darknet are calculated. According to a preferred embodiment of the invention, the most influential domains of the different illegal categories are also calculated in the Tor darknet for which the previous Activity of Interest Graphs were generated. This calculation is carried out in two phases. In a first phase, the influence algorithm is applied, the algorithm for measuring the influence of domains to calculate the ranking of the different domains extracted from the Tor network. The rank value of a domain is obtained as the weighted combination of the sum of the number of hyperlinks of the following and followed domains of the analyzed domain. According to several studies, the destabilization of a network is achieved by eliminating the nodes with a higher ranking, which leads to an obstruction in the flow of information through the graph. According to a preferred embodiment of the invention, the influence within the Tor dark network is interpreted as the amount of obstruction that a node can cause to the Interesting Activity Graph when it is removed. In a second phase, said domains are sorted in descending order according to the rank value obtained, with the first domains being those with the highest value and therefore considered the most influential.

Un segundo aspecto de la presente invención se refiere a un sistema para la clasificación y posterior detección de los dominios ilegales más influyentes dentro de la red oscura Tor a partir de dominios recuperados de la red oscura Tor. El sistema comprende medios de procesamiento de datos, tales como un ordenador con conexión a internet, configurado para rastrear y descargar el texto en bruto o HTML de dominios de la red Tor; preprocesar el texto en bruto para obtener texto preparado para ser analizado; realizar una clasificación automática (opcional) de dichos dominios mediante Frecuencia de Términos - Frecuencia Inversa del Documento y Regresión Logística; generar un Grafo de Actividades de Interés, siendo los nodos los dominios y los enlaces los hipervínculos entre los diferentes dominios; aplicar el algoritmo de influencia para obtener los dominios más influyentes dentro de la red Tor, siendo aquellos que obtienen el mayor valor del algoritmo.A second aspect of the present invention relates to a system for the classification and subsequent detection of the most influential illegal domains within the Tor darknet from domains recovered from the Tor darknet. The system comprises data processing means, such as a computer with an internet connection, configured to track and download raw text or HTML from domains on the Tor network; preprocessing the raw text to get text ready for analysis; perform an automatic classification (optional) of said domains by Frequency of Terms - Inverse Document Frequency and Logistic Regression; generate a Graph of Activities of Interest, with the nodes being the domains and the links being the hyperlinks between the different domains; apply the influence algorithm to obtain the most influential domains within the Tor network, being those that obtain the highest value from the algorithm.

En una realización preferente de la invención, el sistema comprende un ordenador conectado a internet y con acceso configurado a la red oscura Tor. El sistema también puede comprender unos medios de almacenamiento de datos donde se almacenan archivos HTML o texto en bruto, archivos conteniendo el texto preprocesado, las categorías de los dominios, los Grafos de Actividades de Interés y la ordenación de los dominios de la red Tor ordenados según el valor de rango.In a preferred embodiment of the invention, the system comprises a computer connected to the internet and with configured access to the dark network Tor. The system may also comprise a data storage means where HTML files or text are stored in raw, files containing the preprocessed text, the categories of the domains, the Graphs of Activities of Interest and the ordering of the domains of the Tor network ordered according to the rank value.

Por último, la presente invención también se refiere a un producto de programa que comprende medios de instrucciones de programa para llevar a la práctica el procedimiento anteriormente descrito cuando el programa se ejecuta en un procesador. El producto de programa está preferentemente almacenado en un medio de soporte de programas. Los medios de instrucciones de programa pueden tener la forma de código fuente, código objeto, una fuente intermedia de código y código objeto, por ejemplo, como en forma parcialmente compilada, o en cualquier otra forma adecuada para uso en la puesta en práctica de los procesos según la invención.Finally, the present invention also relates to a program product comprising program instruction means for carrying out the above-described procedure when the program is executed on a processor. The program product is preferably stored on a program support medium. The program instruction means may be in the form of source code, object code, an intermediate source of code, and object code, for example, as partially compiled form, or in any other form suitable for use in implementing the processes according to the invention.

El medio de soporte de programas puede ser cualquier entidad o dispositivo capaz de soportar el programa. Por ejemplo, el soporte podría incluir un medio de almacenamiento, como una memoria ROM, una memoria CD ROM o una memoria ROM de semiconductor, una memoria flash, un soporte de grabación magnética, por ejemplo, un disco duro o una memoria de estado sólido (SSD, del inglés solid-state drive). Además, los medios de instrucciones de programa almacenados en el soporte de programa pueden ser, por ejemplo, mediante una señal eléctrica u óptica que podría transportarse a través de cable eléctrico u óptico, por radio o por cualquier otro medio.The program support medium can be any entity or device capable of supporting the program. For example, the medium could include a storage medium, such as ROM memory, CD ROM or semiconductor ROM, flash memory, magnetic recording medium, for example hard disk or solid state memory. (SSD, solid-state drive). Furthermore, the program instruction means stored on the program carrier may be, for example, by an electrical or optical signal which could be carried via electrical or optical cable, by radio or by any other means.

Cuando el producto de programa va incorporado en una señal que puede ser transportada directamente por un cable u otro dispositivo o medio, el soporte de programa puede estar constituido por dicho cable u otro dispositivo o medio.When the program product is incorporated into a signal that can be carried directly by a cable or other device or medium, the program carrier can be constituted by said cable or other device or medium.

Como variante, el soporte de programa puede ser un circuito integrado en el que va incluido el producto de programa, estando el circuito integrado adaptado para ejecutar, o para ser utilizado en la ejecución de los procesos correspondientes.As a variant, the program support can be an integrated circuit in which the program product is included, the integrated circuit being adapted to execute, or to be used in the execution of the corresponding processes.

BREVE DESCRIPCIÓN DE LOS DIBUJOSBRIEF DESCRIPTION OF THE DRAWINGS

A continuación, se describen de manera muy breve una serie de figuras que ayudan a comprender mejor la invención y que se relacionan expresamente con una realización de dicha invención que se presenta como un ejemplo no limitativo de ésta. Next, a series of figures are very briefly described that help to better understand the invention and that expressly relate to an embodiment of said invention that is presented as a non-limiting example thereof.

La Fig. 1 muestra un esquema simplificado de un sistema capaz de llevar a cabo el procedimiento de la invención.Fig. 1 shows a simplified diagram of a system capable of carrying out the process of the invention.

La Fig. 2 muestra un ejemplo del contenido HTML o texto bruto de un dominio de la red oscura Tor.Fig. 2 shows an example of HTML content or raw text from a Tor darknet domain.

La Fig.3 muestra un ejemplo del texto resultante tras el preprocesado del dominio de la red oscura Tor presentado en la Fig. 2.Fig. 3 shows an example of the resulting text after preprocessing the Tor darknet domain presented in Fig. 2.

La Fig. 4 muestra el Grafo de Actividades de Interés para todos los dominios ilegales de la red Tor.Fig. 4 shows the Graph of Activities of Interest for all the illegal domains of the Tor network.

La Fig. 5 muestra el Grafo de Actividades de Interés para los dominios pertenecientes a alguna de las categorías previamente mencionadas, entre ellas “contrabando de tarjetas de crédito” y “venta de drogas” de la red Tor.Fig. 5 shows the Graph of Activities of Interest for the domains belonging to any of the previously mentioned categories, among them “smuggling of credit cards” and “sale of drugs” from the Tor network.

La Fig. 6 muestra la salida del fichero de datos que contendría los dominios de la red oscura Tor ordenados según el valor de rango, indicando la influencia dentro de dicha red oscura.Fig. 6 shows the output of the data file that would contain the domains of the Tor darknet ordered according to the rank value, indicating the influence within said darknet.

REALIZACIÓN PREFERENTE DE LA INVENCIÓNPREFERRED EMBODIMENT OF THE INVENTION

Se describe a continuación un ejemplo de procedimiento de acuerdo con la invención, haciendo referencia a las figuras adjuntas. La Figura 1 muestra un esquema simplificado del sistema de rastreo, clasificación y detección de dominios más influyentes. Todo ello podría ser implementado en un ordenador 2 (que podría ser, cualquier equipo de sobremesa o portátil con un núcleo, 512MB de RAM y 8Gb de disco duro). El ordenador 2 se conecta a internet y se configura para poder acceder a la red oscura Tor 1. A continuación, se realiza un rastreo de dominios 3 y se descargan aquellos que estén activos su texto en bruto 3, obteniendo un fichero HTML de texto 4. Sobre este fichero se realiza un preprocesamiento del texto en bruto 5 para obtener el texto 6 final sobre el que se trabajará. En el ejemplo de procedimiento según la invención, se realiza también una clasificación automática del texto 6a, de la que resultan una serie de etiquetas que se corresponden con las diferentes categorías de los dominios analizados 6b. A partir del texto preprocesado 6, se realiza una extracción de los hipervínculos entrantes y salientes de cada dominio y se construye un Grafo de Actividades de Interés 8 para toda la red oscura Tor. Adicionalmente, en este ejemplo de procedimiento según la invención, se construye un Grafo de Actividades de Interés para cada una de las categorías resultantes tras el proceso de clasificación automática. Finalmente, se aplica el algoritmo de influencia 9, dando lugar a un archivo de datos 10 donde aparecen los dominios de Tor ordenados según su valor de rango. Los dominios situados en la primera posición son considerados los más influyentes de la red. Adicionalmente, en este ejemplo de procedimiento según la invención, se genera un archivo de datos por cada una de las categorías resultantes tras el proceso de clasificación automática, donde aparecen los dominios que pertenecen a una misma categoría ordenados por su valor de rango. A continuación, se describe cada paso del procedimiento de la invención.An example of a method according to the invention is described below with reference to the attached figures. Figure 1 shows a simplified scheme of the most influential domain detection, classification and tracking system. All of this could be implemented on a computer 2 (which could be any desktop or laptop computer with a core, 512MB of RAM and 8Gb of hard disk). Computer 2 connects to the internet and is configured to be able to access the Tor 1 dark network. Next, a domain search is performed 3 and those that are active in their raw text are downloaded 3, obtaining a text HTML file 4 On this file a preprocessing of the raw text 5 is carried out to obtain the final text 6 on which to work. In the example of the method according to the invention, an automatic classification of the text 6a is also carried out, resulting in a series of labels corresponding to the different categories of the analyzed domains 6b. From the preprocessed text 6, an extraction of the hyperlinks is performed incoming and outgoing of each domain and a Graph of Activities of Interest 8 is constructed for the entire Tor darknet. Additionally, in this example of the method according to the invention, a Graph of Activities of Interest is constructed for each of the resulting categories after the automatic classification process. Finally, the influence algorithm 9 is applied, resulting in a data file 10 where the Tor domains appear ordered according to their rank value. The domains located in the first position are considered the most influential of the network. Additionally, in this example of the method according to the invention, a data file is generated for each of the resulting categories after the automatic classification process, where the domains that belong to the same category appear ordered by their rank value. Next, each step of the process of the invention is described.

La conexión del ordenador 2 a internet se puede realizar a través de una conexión inalámbrica o a través de un cable de red Ethernet. La conexión del ordenador 2 a la red oscura Tor 1 comprende un proceso de instalación de un software especial que permita conectarse a dicha red oscura, como por ejemplo la instalación el navegador Tor, “Tor Browser”. El objeto de esta conexión y configuración es la obtención del texto en bruto necesario para poder realizar la clasificación automática y posterior cálculo de los dominios más influyentes.The connection of the computer 2 to the internet can be done via a wireless connection or via an Ethernet network cable. The connection of the computer 2 to the dark web Tor 1 comprises a process of installation of a special software that allows to connect to said dark web, as for example the installation of the browser Tor, “Tor Browser”. The purpose of this connection and configuration is to obtain the raw text necessary to carry out the automatic classification and subsequent calculation of the most influential domains.

A continuación, se procede al rastreo de dominios y descarga de texto en bruto. En primer lugar, se obtiene una lista pública de dominios de la red oscura Tor, que podría obtenerse de la Web Superficial. Dado el ciclo de vida de los dominios de la red Tor, no se facilita ningún enlace en este documento. Esta lista de dominios es leída por el programa de rastreo y se descarga el texto en bruto 3 de aquellos dominios que estén activos, obteniendo así un fichero HTML por dominio activo analizado.This is followed by domain crawling and raw text download. First, you get a public list of domains from the Tor darknet, which could be obtained from the Surface Web. Given the life cycle of domains on the Tor network, no links are provided in this document. This list of domains is read by the tracking program and the raw text 3 of those domains that are active is downloaded, thus obtaining an HTML file per active domain analyzed.

La Figura 2 muestra un ejemplo del contenido HTML o texto bruto de un dominio de la red oscura Tor. Como se puede observar, hay muchos recursos textuales no pertenecientes al lenguaje natural, como por ejemplo las etiquetas del lenguaje de programación HTML, que hay que eliminar antes de continuar con el procedimiento y así lograr una mayor precisión en la clasificación. Figure 2 shows an example of HTML or raw text content from a Tor darknet domain. As can be seen, there are many non-natural language textual resources, such as HTML programming language tags, that must be removed before continuing with the procedure and thus achieve greater precision in the classification.

En la siguiente etapa se procede al preprocesamiento del texto en bruto contenido en los ficheros HTML recuperados de la red oscura Tor. Primero se eliminan las etiquetas del lenguaje HTML y, en el caso de las etiquetas que referencian imágenes, se elimina laIn the next stage, the raw text contained in the HTML files retrieved from the Tor darknet is preprocessed. First, the HTML language tags are removed and, in the case of tags that reference images, the

1 one

extensión y se deja el nombre de la imagen. A continuación, se seleccionan aquellos dominios cuyo lenguaje es el inglés, dado que es el idioma dominante de la red Tor, aunque podría realizarse con otros lenguajes. En esta realización preferente de la invención, dicha selección se realiza con la librería Langdetect (https://pypi.python.org/pypi/langdetect). Por último se eliminan caracteres especiales y palabras vacías a través de la lista de palabras vacías SMART (http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/). Debido al ámbito de trabajo, es decir, la red oscura Tor, se modifica dicha lista y se añaden 100 nuevas palabras para mejorar la compatibilidad. Finalmente, se unifican todos los emails, direcciones web y monedas en un único recurso textual. La Figura 3 muestra un ejemplo del texto resultante tras el preprocesado del dominio de la red oscura Tor presentado en la Figura 2.extension and the image name is left. Next, those domains whose language is English are selected, since it is the dominant language of the Tor network, although it could be done with other languages. In this preferred embodiment of the invention, said selection is made with the Langdetect library (https://pypi.python.org/pypi/langdetect). Finally, special characters and stopwords are eliminated through the SMART stopword list (http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/) . Due to the scope of work, that is, the Tor darknet, this list is modified and 100 new words are added to improve compatibility. Finally, all emails, web addresses and currencies are unified in a single textual resource. Figure 3 shows an example of the resulting text after preprocessing the Tor darknet domain presented in Figure 2 .

Tras el preprocesado del texto en bruto se procede a la clasificación automática de los dominios, con objeto de poder calcular cuáles son los dominios más relevantes dentro de cada categoría, y no solo identificarlos a nivel de toda la red oscura Tor. El texto ya procesado se codifica preferiblemente mediante TF-IDF (Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Information Processing & Management, 39(1):45-65.), usando una longitud de vector mínima de tres y máxima de 10000 elementos. A continuación, se entrena el sistema con LR (David W. Hosmer Jr. and Stanley Lemeshow. 2004. Applied logistic regression. John Wiley & Sons), activando el balance de pesos entre clases. Las categorías resultantes son: (i) pornografía, (ii) criptomoneda, (iii) contrabando de tarjetas de crédito, (iv) venta de drogas ilegales, (v) actividades violentas, (vi) ataques cibernéticos (hacking), (vii) falsificación de moneda, (viii) contrabando de identificación personal y (ix) otros.After the preprocessing of the raw text, the domains are automatically classified, in order to be able to calculate which are the most relevant domains within each category, and not only identify them at the level of the entire Tor dark web. The already processed text is preferably encoded by TF-IDF (Akiko Aizawa. 2003. An information-theoretic perspective of tf-idf measures. Information Processing & Management, 39 (1): 45-65.), Using a minimum vector length of three and maximum of 10,000 elements. Next, the system is trained with LR (David W. Hosmer Jr. and Stanley Lemeshow. 2004. Applied logistic regression. John Wiley & Sons), activating the balance of weights between classes. The resulting categories are: (i) pornography, (ii) cryptocurrency, (iii) credit card smuggling, (iv) sale of illegal drugs, (v) violent activities, (vi) cyber attacks (hacking), (vii) counterfeiting of currency, (viii) contraband of personal identification and (ix) others.

Una vez completado el preprocesamiento del texto en bruto y la clasificación de dominios, se construyen los Grafos de Actividades de Interés. Inicialmente se extrae para cada dominio los hipervínculos entrantes y salientes pertenecientes a los protocolos HTTP y HTTPS. Durante este proceso se eliminan los hipervínculos, y por lo tanto enlaces del grafo, duplicados, es decir, aquellos que tienen el mismo origen y destino, para evitar la creación de multigrafos, y también aquellos hipervínculos que apuntan a la Web Superficial. A continuación, se construye el Grafo de Actividades de Interés, donde los nodos se corresponden con dominios, y los enlaces con los diferentes hipervínculos entrantes y salientes contenidos en los anteriores dominios. Se genera un enlace entre dos nodos A y B siempre que el dominio A haga referencia al dominio B, o viceversa, al menos una vez. Once the preprocessing of the raw text and the domain classification have been completed, the Graphs of Activities of Interest are constructed. Initially, the incoming and outgoing hyperlinks belonging to the HTTP and HTTPS protocols are extracted for each domain. During this process, the hyperlinks are eliminated, and therefore duplicate links from the graph, that is, those that have the same origin and destination, to avoid the creation of multigraphs, and also those hyperlinks that point to the Superficial Web. Next, the Graph of Activities of Interest is constructed, where the nodes correspond to domains, and the links to the different incoming and outgoing hyperlinks contained in the previous domains. A link is generated between two nodes A and B as long as domain A refers to domain B, or vice versa, at least once.

La Figura 4 muestra una vista general del Grafo de Actividades de Interés para todos los dominios considerados en la red Tor. Figure 4 shows a general view of the Graph of Activities of Interest for all the domains considered in the Tor network.

La Figura 5 muestra una vista más detallada del Grafo de Actividades de Interés, donde se pueden ver como los nodos que representan cada dominio están categorizados y los múltiples enlaces que hay entre ellos. Figure 5 shows a more detailed view of the Graph of Activities of Interest, where you can see how the nodes that represent each domain are categorized and the multiple links between them.

Finalmente, se realiza el cálculo de la influencia de la lista de dominios para todos los dominios analizados de la red oscura Tor como para los dominios dentro de las siguientes categorías (i) pornografía 11, (ii) criptomoneda 12, (iii) contrabando de tarjetas de crédito 13, (iv) venta de drogas ilegales 14, (v) actividades violentas 15, (vi) ataques cibernéticos 16 (hacking), (vii) falsificación de moneda 17, (viii) contrabando de identificación personal 18. No se incluye la categoría otros dado que engloba actividades de múltiples tipos y ya se está calculando un listado de dominios para toda la red, con lo que se considera que no contribuye a aportar un listado relevante de dominios.Finally, the calculation of the influence of the list of domains is carried out for all the analyzed domains of the Tor dark network as well as for the domains within the following categories (i) pornography 11, (ii) cryptocurrency 12, (iii) smuggling of credit cards 13, (iv) sale of illegal drugs 14, (v) violent activities 15, (vi) cyber attacks 16 (hacking), (vii) counterfeiting of currency 17, (viii) smuggling of personal identification 18. Do not know includes the category others since it encompasses activities of multiple types and a list of domains is already being calculated for the entire network, which is considered not to contribute to providing a relevant list of domains.

Esta medida de la influencia está basada en el cálculo del valor de rango asociado a cada dominio y la posterior ordenación descendente de dichos dominios según el valor de rango obtenido, siendo los primeros dominios los de mayor valor y, por lo tanto, los considerados como más influyentes. El algoritmo de influencia identifica el nodo más central de un grafo midiendo el número de nodos a los que puede propagar el tráfico y el número de nodos desde los que recibe tráfico. El cálculo del valor de rango consta de dos fases, una fase de inicialización y otra de actualización de pesos. Dado un Grafo de Actividades de Interés que contiene N nodos y E enlaces, el algoritmo se inicializa asignando un peso inicial In a cada nodo n, utilizando la siguiente fórmula:This measure of influence is based on the calculation of the rank value associated with each domain and the subsequent descending ordering of said domains according to the rank value obtained, with the first domains being those with the highest value and, therefore, those considered as most influential. The influence algorithm identifies the most central node in a graph by measuring the number of nodes to which traffic can propagate and the number of nodes from which it receives traffic. The calculation of the range value consists of two phases, an initialization phase and a weight update phase. Given a Graph of Activities of Interest that contains N nodes and E links, the algorithm is initialized by assigning an initial weight In to each node n, using the following formula:

Wn = Di D0 (1) Wn = Di D0 (1)

donde Di es el valor del grado de entrada y D0 es el valor del grado de salida respectivamente. El valor Di se relaciona con el número de enlaces entrantes de un nodo y el valor D0 representa el número de enlaces salientes de dicho nodo.where Di is the value of the input degree and D0 is the value of the output degree respectively. The value Di is related to the number of incoming links from a node and the value D0 represents the number of outgoing links from that node.

A continuación, se asigna a cada nodo n el peso acumulado de sus seguidores, que son los nodos que están apuntando al nodo n , y el peso de los nodos a los que el nodo n sigue. El valor de rango (R) se asigna al nodo n teniendo en cuenta el peso inicial Wn asignado (1) según la siguiente fórmula:Next, each node n is assigned the cumulative weight of its followers, which are the nodes that are pointing to node n , and the weight of the nodes that node n follows. The rank value (R) is assigned to node n taking into account the initial weight Wn assigned (1) according to the following formula:

Rn = Wn log(aW F p w f ) (2) Rn = Wn log ( aW F pwf) (2)

donde WF es el peso acumulado de los seguidores y Wf es el peso acumulado de los nodos a los que n sigue. Los parámetros a , cont ro lan la contribución de los pesos de los seguidores y de los nodos a los que el nodo sigue.where WF is the accumulated weight of the followers and Wf is the accumulated weight of the nodes that n follows. The parameters a , control the contribution of the weights of the followers and of the nodes that the node follows.

El algoritmo de influencia permite identificar qué dominio es el más influyente dentro de la red, es decir, el dominio cuya eliminación provocaría la mayor desestabilización de la red. Dicha desestabilización afectaría a una reducción muy elevada u obstrucción en la transmisión de información a través de los diferentes dominios. Para medir cómo se ve afectada la transmisión de información dentro de un grafo tras eliminar un nodo, se utiliza una medida de densidad según la siguiente fórmulaThe influence algorithm allows identifying which domain is the most influential within the network, that is, the domain whose removal would cause the greatest destabilization of the network. Said destabilization would affect a very high reduction or obstruction in the transmission of information through the different domains. To measure how the transmission of information within a graph is affected after eliminating a node, a density measure is used according to the following formula

EAND

Dg N (N -1) (2) Dg N (N -1) (2)

Donde E representa el número de enlaces y N el número de nodos del grafo. Por lo tanto, una ordenación de dominios según su influencia sería aquella que tuviera la densidad más baja posible después de eliminar el menor número de nodos posibles dentro de los que hubieran obtenido una puntuación o valor de rango alta.Where E represents the number of links and N the number of nodes in the graph. Therefore, a domain ordering according to their influence would be one that had the lowest possible density after eliminating the fewest possible nodes within which they would have obtained a high rank score or value.

Una vez obtenido el valor de rango para todos los nodos, se realiza un procedimiento de evaluación donde se van eliminando del grafo uno a uno los nodos que han obtenido el mayor valor de rango y se va recalculando la densidad después de cada eliminación. Dicho proceso continua hasta que el grafo está completamente desconectado, es decir, con una densidad de 0. A través de experimentación con diferentes valores, se asigna a los parámetros los valores de 1.0 y 0.2 respectivamente como los valores finales para el algoritmo de influencia, dado que son los que permiten obtener el menor valor del área bajo la curva (AUC, “Area Under the Curve”), lo que está asociado a la densidad más baja posible a la hora de eliminar el menor número de nodos con el mayor valor de ranking.Once the rank value has been obtained for all the nodes, an evaluation procedure is carried out where the nodes that have obtained the highest rank value are eliminated from the graph one by one and the density is recalculated after each elimination. This process continues until the graph is completely disconnected, that is, with a density of 0. Through experimentation with different values, the values of 1.0 and 0.2 are assigned to the parameters respectively as the final values for the influence algorithm, since they are the ones that allow obtaining the lowest value of the area under the curve (AUC, “Area Under the Curve”), which is associated with the lowest possible density when eliminating the least number of nodes with the highest value ranking.

La Figura 6 muestra un ejemplo de la salida del fichero de datos que contendría los dominios de la red oscura Tor ordenados según el algoritmo de influencia, indicando la influencia dentro de dicha red oscura. Figure 6 shows an example of the output of the data file that would contain the domains of the Tor darknet ordered according to the influence algorithm, indicating the influence within said darknet.

1 one

Claims

1. Procedure for the classification and detection of the most influential domains within the Tor (1) dark web, characterized in that it comprises the following steps:

- crawl a plurality of domains within the Tor dark network (1) and download raw text (3) from at least one domain (7), through a computer (2) connected to the internet and configured to access the network dark Tor (1);

- obtain an HTML file (4) from the raw text (3);

- preprocessing the raw text (5) to obtain a preprocessed text (6) and a plurality of incoming and outgoing hyperlinks extracted from the at least one domain (7);

- perform an automatic classification (6a) of the preprocessed text (6) through a coding of the text using feature vectors and a regression-based machine learning algorithm, and obtain a series of domain categories (6b);

- constructing a graph of activities of interest (8) from the incoming and outgoing hyperlinks extracted from the at least one domain (7);

- determine the rank value (9) for each domain, which is obtained as the weighted combination of the sum of the number of links of each node, the links obtained from the hyperlinks to follower domains and followed by each domain (7) ;

- perform an ordering of the nodes, and therefore of the corresponding domains, according to their rank value;

- label the domains with the highest rank value as those most influential within the Tor darknet.

2. Method according to claim 1, wherein the preprocessing of the raw text (5) of the HTML files (4) comprises:

- remove the HTML language tags;

- remove the extension and leave the image name in the case of labels that reference images;

- select domains in English with a language detection library;

- remove special characters and stopwords through a list of stopwords;

- edit and add a plurality of new stopwords to improve compatibility with the Tor darknet domain;

- Unify all emails, web addresses and currencies in a single textual resource.

3. Method according to claim 1, wherein the step of performing an automatic classification of the domains (6a) comprises:

- encode the text already processed by means of the Frequency of Terms-Inverse Document Frequency technique using a feature vector length of between three and 10000 elements;

- train a machine learning algorithm with a labeled training set using Logistic Regression, and activating the balance of weights between classes;

- classify, using the trained machine learning algorithm, the Tor (1) darknet domains into at least one of the nine defined classes:

• pornography,

• cryptocurrency,

• credit card smuggling,

• sale of illegal drugs,

• violent activities,

• cyber attacks,

• counterfeiting of currency,

• smuggling personal identification, and

• others.

4. Method according to claim 1, comprising displaying a label of each text block (6) once it has been classified.

5. Method according to claim 1, comprising the construction of a graph of activities of interest (8) for the entire Tor dark network (1) and a graph of activities of interest (8) for each category resulting from the classification (6b ) through the incoming and outgoing hyperlinks (7) of each domain, where the construction of the graph of activities of interest comprises:

- extract for each domain the incoming and outgoing hyperlinks belonging to the HTTP and HTTPS protocols;

- Eliminate hyperlinks that have the same origin and destination, and hyperlinks that point to the Superficial Web;

- Build a Graph of Activities of Interest, where the nodes correspond to domains, and the links to the different incoming and outgoing hyperlinks contained in the previous domains so that a link is generated between two

1 one

nodes A and B as long as domain A refers to domain B, or vice versa, at least once.

6. Procedure according to claim 1, comprising the calculation of the influence algorithm for all the domains of the Graph of Activities of global Interest (8), identifying as the most relevant domains within the Tor dark network those that obtain the highest rank values , where the calculation of the range value for any domain comprises:

- initialize the range value with an initial weight based on the number of incoming and outgoing links of said node within the Graph of Activities of Interest;

- update the rank value taking into account the initial weight and the weighted sum of the accumulated weight of its followers and followed domains, where the influence of the weights of the followers and followed domains is calculated through two parameters a, p.

7. Procedure according to claim 1, comprising the calculation of the influence algorithm for all the domains of the Global Interest Activities Graph (8) calculated for each category, identifying as the most relevant domains within each category of the Tor dark network those that obtain the highest range values within the selected category, where the calculation of the range value for any domain comprises:

- initialize the rank value with an initial weight based on the degree of entry and exit of said node, that is, the number of incoming and outgoing links of said node within the Graph of Activities of Interest;

- update the rank value taking into account the initial weight and the weighted sum of the accumulated weight of your followers and followed domains. The influence of the weights of the following and followed domains is calculated through two parameters a, fí.

8. System for the classification and detection of the most influential domains within the Tor dark network characterized by comprising data processing means (2) configured to:

- obtain an HTML file (4) from the raw text (3);

1 one

System according to claim 8, comprising a computer (2) connected to the internet and configured to be able to access the dark network Tor (1), containing the program for tracking and downloading raw text (3), the program for the automatic classification of text (6a), the construction of the Graph of Activities of Interest (8), the calculation of the rank value for all the analyzed domains and the indication of the most relevant domains of the Tor dark web (10) a global level and by categories.

System according to any of claims 8 to 9, comprising data storage means where the HTML files (4), the preprocessed text (6), the categories of the analyzed domains (6b) and the file of data with the lists of domains ordered according to their level of influence.

A program product comprising program instruction means for carrying out the method defined in any one of claims 1 to 7 when the program is executed on a processor.

12. A program product according to claim 11, stored on a program support medium.