Processing text corpora with NLTK: Unterschied zwischen den Versionen

Aus Westslawische Sprachen

Wechseln zu: Navigation, Suche
Keine Bearbeitungszusammenfassung
Keine Bearbeitungszusammenfassung
 
Zeile 1: Zeile 1:
Here is an elegant way to process your text data with professional CL tools: Import it into [http://nltk.org NLTK], the ''Natural Language Toolkit'' for Python3 and use all the processing tools available in NLTK.  
Here is an elegant way to process your text data with professional CL tools: Import it into [http://nltk.org NLTK], the ''Natural Language Toolkit'' for Python3 and use all the processing tools available in NLTK.  


Let us assume you have Python3 and the NLTK installed (if you haven't, follow the instructions in [[How to install Python3 and the NLTK]]. Furthermore, let us assume that your corpus texts are in UTF-8 format and stored in the folder `/Users/roland/mycorpus`. Then open a Python3 shell (IDLE, command line or other) and type


Let us assume you have Python3 and the NLTK installed (if you haven't, follow the instructions in [[How to install Python3 and the NLTK]]. Furthermore, let us assume that your corpus texts are in UTF-8 format and stored in the folder `/Users/roland/mycorpus`. Then open a Python3 shell (IDLE, command line or other) and type


<pre>
<pre>
> import nltk
> import nltk
> from nltk.corpus import PlaintextCorpusReader
> from nltk.corpus import PlaintextCorpusReader
> root = "/Users/roland/mycorpus"
> root = "/Users/roland/mycorpus"
> mycorpus = PlaintextCorpusReader(root, '.*') # mycorpus is a nltk.Corpus
> mycorpus = PlaintextCorpusReader(root, '.*') # mycorpus is a nltk.Corpus
> mycorpusText = nltk.Text(mycorpus.words()) # mycorpusText is a nltk.Text
> mycorpusText = nltk.Text(mycorpus.words()) # mycorpusText is a nltk.Text
> mycorpus.words()
> mycorpus.words()
> mycorpusText.concordance("кое-какое")
> mycorpusText.concordance("кое-какое")
</pre>
</pre>
(Achtung! > ist der command prompt. Diesen bitte nicht tippen!)

Aktuelle Version vom 14. Mai 2017, 00:18 Uhr

Here is an elegant way to process your text data with professional CL tools: Import it into NLTK, the Natural Language Toolkit for Python3 and use all the processing tools available in NLTK.

Let us assume you have Python3 and the NLTK installed (if you haven't, follow the instructions in How to install Python3 and the NLTK. Furthermore, let us assume that your corpus texts are in UTF-8 format and stored in the folder `/Users/roland/mycorpus`. Then open a Python3 shell (IDLE, command line or other) and type


> import nltk
> from nltk.corpus import PlaintextCorpusReader
> root = "/Users/roland/mycorpus"
> mycorpus = PlaintextCorpusReader(root, '.*') # mycorpus is a nltk.Corpus
> mycorpusText = nltk.Text(mycorpus.words()) # mycorpusText is a nltk.Text
> mycorpus.words()
> mycorpusText.concordance("кое-какое")


(Achtung! > ist der command prompt. Diesen bitte nicht tippen!)