Unified file structure

Aus MelaTAMP

Wechseln zu: Navigation, Suche

Unified file structure

The project’s git repository exhibits the unified file structure outlined below.

Naming conventions

The conventions for naming corpus files and directories in the git repository is as follows.

  • All names in lowercase
  • No spaces, instead use dashes (-)

Directory structure

git root directory
├── corpora (1)
│   └── [language-name] (2)
│       ├── [format-name] (3)
│       │   └── [format-file] (4)
│       ├── original (5)
│       │   └── [original-files]
│       ├── read-only-user-files (6)
│       │   └── [files]
│       └── [version-directory] (7)
│           ├── [format-name]
│           │   └── [original-files]
│           ├── original
│           │   └── [original-files]
│           └── read-only-user-files
│               └── [file]
└── processing (8)
    ├── analysis (9)
    │   └── [analysis-directory]
    │       └── [files]
    └── [processing-type] (10)
        └── [files]
  1. corpora holds the primary data, i.e., the corpus data.
  2. One directory per language, named with the name of the language.
  3. One directory per format, i.e., format in which the corpus has been persisted.
  4. Corpus files in the respective format, cf. below.
  5. original holds the original primary data as acquired from data providers (corpus “authors”, etc.).
  6. read-only-user-files holds files that have been created for specific use cases other than conversion. Note that these files must be read-only, e.g., they will never be converted, published, or referenced.
  7. In cases where the corpus has been versioned, each version is completely contained in a version directory, which is a child of the language directory. Version directories should be named v{version number}, e.g., v2. A readme file (in Markdown format and named README.md) should be included in the version directory to document reasons for creating the respective version. Version directories replicate the directory structure of the language directory. If a version directories exist for a language, the language directories only children must be the version directories.
  8. processing holds files and directories for analysis results and data processing.
  9. analysis holds directories which are named in respect to the analysis performed on corpus data, e.g., genre-synopsis-extraction.
  10. Processing directories should be direct children of processing and should be named after the stage at which the processing is performed, i.e., prefixed with pre or post, the processing step they are referring to and the processing outcome. E.g., pre-conversion-to-annis may contain files that refer to processing steps that can be performed before a format conversion to the ANNIS format.

Corpus file names

Files containing primary corpus data should be named as follows. Note that if there is only one corpus data file per language, the original ID must be replaced with corpus to document that the file holds the complete corpus (as opposed to one or more corpus documents).

<language-name>-<format>-<orig-id/`corpus`>.<file-extension>

E.g., a file containing a document called interview1 from a Mavea corpus in the Toolbox text format would be named mavea-toolbox-interview1.txt. A file containing a complete corpus from Daakie in the FLExText format would be named daakie-flextext-corpus.flextext.