2018142125 조정빈
“I, 조정빈, pledge that this assignment is my own work. I am committed to upholding
the highest standards of integrity in all academic endeavors. I understand that any form of
dishonesty, such as plagiarism, will not be tolerated and may result in disciplinary action.”
Graph-Based Text Analysis Report
(a) Explanation of Stop Words and Their Importance
Stop Words
Stop words are commonly used words in a language that are filtered out before processing text data. Examples include "is," "and," "the," "to," etc. These words are often insignificant in the context of extracting meaningful information from a text.
Importance in Keyword Extraction and Text Summarization
- Keyword Extraction: Identifying and removing stop words is crucial because these words do not contribute to the unique content of the text. Including them in the analysis could dilute the importance of truly significant words and lead to less accurate keyword extraction.
- Text Summarization: In summarization, stop words need to be excluded to focus on the core content of the sentences. They can clutter the summary with unnecessary information and make the summary less concise and less readable.
(b) TextRank Algorithm for Keyword Extraction
Process Description
- Preprocessing: Tokenize the text and remove stop words and punctuation.
- Graph Construction: Build a graph where each node represents a word, and edges represent co-occurrence within a fixed window of words.
- Edge Weights: Assign weights to edges based on the frequency and proximity of co-occurrence.
- TextRank Calculation: Apply the TextRank algorithm, which is similar to PageRank, to calculate the importance of each word (node) in the graph. This involves iterative computation until convergence.
- Keyword Extraction: Extract top-ranked words as keywords based on their TextRank scores.
(c) Tokenization and Part-of-Speech Tagging Analysis
Tokenization
- Tokens: "Coffee," "is," "one," "of," "the," "most," "beloved," "beverages," "worldwide," "enjoyed," "by," "millions," "of," "people," "every," "day," "The," "history," "of," "coffee," "dates," "back," "to," "the," "15th," "century," "when," "it," "was," "first," "discovered," "in," "Ethiopia," "From," "there," "it," "spread," "to," "the," "Arabian," "Peninsula," "and," "eventually," "to," "Europe," "and," "the," "Americas," "Coffee," "is," "made," "from," "roasted," "coffee," "beans," "which," "are," "the," "seeds," "of," "berries," "from," "the," "Coffea," "plant," "The," "two," "most," "common," "types," "of," "coffee," "beans," "are," "Arabica," "and," "Robusta," "each," "offering," "distinct," "flavors," "and," "aromas," "Coffee," "not," "only," "provides," "a," "caffeine," "boost," "but," "is," "also," "rich," "in," "antioxidants," "which," "can," "help," "protect," "against," "various," "diseases," "The," "coffee," "industry," "is," "a," "major," "economic," "driver," "with," "countries," "like," "Brazil," "Vietnam," "and," "Colombia," "being," "the," "top," "producers," "Coffee," "culture," "has," "evolved," "significantly," "with," "numerous," "brewing," "methods," "and," "a," "variety," "of," "specialty," "coffee," "drinks," "available," "today."
Part-of-Speech (POS) Tagging