Concept Identification from Single-Documents

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

© Springer Nature Switzerland AG 2018. This article presents a method that extracts relevant concepts automatically, consisting of one or several words, whose main contribution is that it does so from a single document of any domain, regardless of its length; however, documents of short length are used (which are the most frequent to obtain on the web) to perform the work. This research was conducted for documents written in Spanish and was tested in multiple randomized domains to compare their results. For this, an algorithm was used to automatically identify syntactic patterns in the document. This work uses the previous work of [1] to obtain its results. This algorithm is based on statistical approximations and on the length of the identifiable patterns contained in the document, applies certain heuristic that can enhance or decrease the patterns’ choice according to the selection of one of the 5 methods that are processed (M1 to M5), with these patterns the candidate concepts are obtained, which go through another evaluation process that will obtain the final concepts. This proposal presents at least four advantages: (1) It is multi-domain, (2) It is independent of the text length, (3) It can work with one or more documents and (4) It allows the discarding of garbage or undesirable patterns from the beginning. The method was implemented in 11 different domains and its results range varies between 58%–70% of precision and 25%–46% of recall.
Original languageAmerican English
Title of host publicationConcept identification from single-documents
Pages158-173
Number of pages16
DOIs
StatePublished - 1 Nov 2018
EventInternational Conference on Technologies and Innovation: 4th International Conference, CITI 2018, Guayaquil, Ecuador - Guayaquil, Ecuador
Duration: 6 Nov 20189 Nov 2018
Conference number: 4
https://link.springer.com/book/10.1007/978-3-030-00940-3

Conference

ConferenceInternational Conference on Technologies and Innovation
Abbreviated titleTechnologies and Innovation
CountryEcuador
CityGuayaquil
Period6/11/189/11/18
Internet address

Fingerprint

Syntactics
Concepts
Vary
Heuristics
Decrease
Evaluation
Approximation
Range of data
Syntax
Text

Keywords

  • Concept extraction
  • Syntactic patterns
  • Text analysis
  • Single-documents

Cite this

@inproceedings{58e3bdd7f8944cf7bce55da9bd76ed03,
title = "Concept Identification from Single-Documents",
abstract = "{\circledC} Springer Nature Switzerland AG 2018. This article presents a method that extracts relevant concepts automatically, consisting of one or several words, whose main contribution is that it does so from a single document of any domain, regardless of its length; however, documents of short length are used (which are the most frequent to obtain on the web) to perform the work. This research was conducted for documents written in Spanish and was tested in multiple randomized domains to compare their results. For this, an algorithm was used to automatically identify syntactic patterns in the document. This work uses the previous work of [1] to obtain its results. This algorithm is based on statistical approximations and on the length of the identifiable patterns contained in the document, applies certain heuristic that can enhance or decrease the patterns’ choice according to the selection of one of the 5 methods that are processed (M1 to M5), with these patterns the candidate concepts are obtained, which go through another evaluation process that will obtain the final concepts. This proposal presents at least four advantages: (1) It is multi-domain, (2) It is independent of the text length, (3) It can work with one or more documents and (4) It allows the discarding of garbage or undesirable patterns from the beginning. The method was implemented in 11 different domains and its results range varies between 58{\%}–70{\%} of precision and 25{\%}–46{\%} of recall.",
keywords = "Concept extraction, Syntactic patterns, Text analysis, Single-documents",
author = "{Ochoa Hern{\'a}ndez}, {Jos{\'e} Luis} and {Barcel{\'o} Valenzuela}, Mario and {S{\'a}nchez Schmitz}, Gerardo and {Torres Peralta}, Raquel",
year = "2018",
month = "11",
day = "1",
doi = "10.1007/978-3-030-00940-3_12",
language = "American English",
pages = "158--173",
booktitle = "Concept identification from single-documents",

}

Ochoa Hernández, JL, Barceló Valenzuela, M, Sánchez Schmitz, G & Torres Peralta, R 2018, Concept Identification from Single-Documents. in Concept identification from single-documents. pp. 158-173, International Conference on Technologies and Innovation, Guayaquil, Ecuador, 6/11/18. https://doi.org/10.1007/978-3-030-00940-3_12

Concept Identification from Single-Documents. / Ochoa Hernández, José Luis; Barceló Valenzuela, Mario; Sánchez Schmitz, Gerardo; Torres Peralta, Raquel.

Concept identification from single-documents. 2018. p. 158-173.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Concept Identification from Single-Documents

AU - Ochoa Hernández, José Luis

AU - Barceló Valenzuela, Mario

AU - Sánchez Schmitz, Gerardo

AU - Torres Peralta, Raquel

PY - 2018/11/1

Y1 - 2018/11/1

N2 - © Springer Nature Switzerland AG 2018. This article presents a method that extracts relevant concepts automatically, consisting of one or several words, whose main contribution is that it does so from a single document of any domain, regardless of its length; however, documents of short length are used (which are the most frequent to obtain on the web) to perform the work. This research was conducted for documents written in Spanish and was tested in multiple randomized domains to compare their results. For this, an algorithm was used to automatically identify syntactic patterns in the document. This work uses the previous work of [1] to obtain its results. This algorithm is based on statistical approximations and on the length of the identifiable patterns contained in the document, applies certain heuristic that can enhance or decrease the patterns’ choice according to the selection of one of the 5 methods that are processed (M1 to M5), with these patterns the candidate concepts are obtained, which go through another evaluation process that will obtain the final concepts. This proposal presents at least four advantages: (1) It is multi-domain, (2) It is independent of the text length, (3) It can work with one or more documents and (4) It allows the discarding of garbage or undesirable patterns from the beginning. The method was implemented in 11 different domains and its results range varies between 58%–70% of precision and 25%–46% of recall.

AB - © Springer Nature Switzerland AG 2018. This article presents a method that extracts relevant concepts automatically, consisting of one or several words, whose main contribution is that it does so from a single document of any domain, regardless of its length; however, documents of short length are used (which are the most frequent to obtain on the web) to perform the work. This research was conducted for documents written in Spanish and was tested in multiple randomized domains to compare their results. For this, an algorithm was used to automatically identify syntactic patterns in the document. This work uses the previous work of [1] to obtain its results. This algorithm is based on statistical approximations and on the length of the identifiable patterns contained in the document, applies certain heuristic that can enhance or decrease the patterns’ choice according to the selection of one of the 5 methods that are processed (M1 to M5), with these patterns the candidate concepts are obtained, which go through another evaluation process that will obtain the final concepts. This proposal presents at least four advantages: (1) It is multi-domain, (2) It is independent of the text length, (3) It can work with one or more documents and (4) It allows the discarding of garbage or undesirable patterns from the beginning. The method was implemented in 11 different domains and its results range varies between 58%–70% of precision and 25%–46% of recall.

KW - Concept extraction

KW - Syntactic patterns

KW - Text analysis

KW - Single-documents

U2 - 10.1007/978-3-030-00940-3_12

DO - 10.1007/978-3-030-00940-3_12

M3 - Conference contribution

SP - 158

EP - 173

BT - Concept identification from single-documents

ER -