GitHub - dkorenci/doc-topic-coherence: Code of the experiments from the article "Document-based Topic Coherence Measures for News Media Text"

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cronlp		cronlp
doc_topic_coh		doc_topic_coh
docker		docker
experiments		experiments
file_utils		file_utils
gensim_mod		gensim_mod
logging_utils		logging_utils
palmetto		palmetto
pytopia		pytopia
stat_utils		stat_utils
sys_utils		sys_utils
LICENSE		LICENSE
compile_project.py		compile_project.py
readme.txt		readme.txt

Repository files navigation

This package contains code of the experiments from the article
"Document-based Topic Coherence Measures for News Media Text"
https://doi.org/10.1016/j.eswa.2018.07.063
share link: https://authors.elsevier.com/a/1XWV13PiGT7grp
If you use the code for scientific research, please cite the original article. 

All the code is licensed under GNU Lesser General Public License, 
except for the Palmetto modified code and .jar in the 'palmetto' folder, 
which are licensed under GNU Affero General Public License. 

This package contains original code and code derived (minor source modifications) 
from two other projects: Palmetto (palmetto folder) and Gensim (gensim_mod folder). 

The companion dataset can be found at: https://github.com/dkorenci/doc-topic-coherence-data

*** Setting up the environment ***

This is Python 2.7 code, the code should work with the latest version of the
numpy/scipy ecosystem and other required libraries.
Gensim 0.12.4 is recommended as the serialized topic models were build with this version.
In order to run the experiments with word coherence measures, Palmetto jar
has to be wrapped as a python package via the jcc tool.
See the palmetto folder for more information.

Alternatively, you can build a docker image using the code in the 'docker' folder, 
or download a prebuilt docker image from: https://rebrand.ly/doc-coh-docker

To run the experiments, the resources contained in the dataset package and the
resources linked from this package have to be downloaded and the variables in
the settings module have to point to the location of the resources.

*** Code pointers ***

Code of most of the experiments can be found in the 'doc_topic_coh.coherence' package:
The 'measure_evaluation' package contains the model selection,
as well as the quantitative experiments from the Section 4 and Section 5 of the articles.
The qualitative_analysis module contains the experiments from Section 5.3. 
The 'experiment' module contains the helper class for running the experiments.
The 'coherence_builder' module contains the factory for coherence functions.

The 'doc_topic_coh.topic_discovery' package contains experiments from Section 6.

Code for creation, saving and loading of the topic dataset is contained within the 'dataset' package.

*** Applying the document coherence measures to new datasets ***

First, create the basic resources for your experiment:
corpus, dictionary, text2tokens converter, topic model.
You can modify the existing classes or adapt your own classes
to the interfaces defined by the corresponding classes in pytopia package.
Then, create the context (map of ids -> resource objects) with the resources.
Finally, use the CoherenceFunctionBuilder to create coherence functions from the parameters.
All the intermediate resources, such as documents indexed with document-topic
proportions, will be created automatically if the objects adhere to the interfaces.
See the resources.pytopia_context and its uses for more information.

*** Dataset ***

Code for creation, saving and loading of the dataset is contained within the 'dataset' package.

The dataset of labeled topics is created by loading model topics and labeling them.
Model topics are loaded from serialized topic models from the context.
Topic labels are created from model topic annotations, in two steps.
The first step creates topic "features" from the topic descriptions and the table of semantic topics.
XML file residing in the topic model folders contains textual description of topics.
Sem.topic tables are xlsx tables that link model topics to semantic topics (concepts).
Descriptions and sem.topic tables are products of the model topic annotation process.
Topic features contain data relevant for topic categorization, these features
are somewhat dataset dependent as slightly different annotation conventions were used.
Labeling labels model topics, based on their features, as one of five classes:
theme (one semantic topic), theme_noise (one semantic topic + noise),
theme_mix (two or more semantic topics), theme_mix_noise , noise.
Note that, labeling in the context of the topic dataset refers to labeling with
these five more specific classes, not 0/1 coherence labels.
Coherence labels are created upon running the experiments by
converting 'theme' and 'theme_noise' to 1 (coherent) and other labels to 0 (incoherent).
Coherence labels of topics are created from topic categories:
theme or theme_noise are considered coherent.
The details of the process is described in the article section 4.1

About

Code of the experiments from the article "Document-based Topic Coherence Measures for News Media Text"

Readme

LGPL-3.0 license

Activity

4 stars

1 watching

1 fork

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cronlp

cronlp

doc_topic_coh

doc_topic_coh

docker

docker

experiments

experiments

file_utils

file_utils

gensim_mod

gensim_mod

logging_utils

logging_utils

palmetto

palmetto

pytopia

pytopia

stat_utils

stat_utils

sys_utils

sys_utils

LICENSE

LICENSE

compile_project.py

compile_project.py

readme.txt

readme.txt

Repository files navigation

About

Releases

Packages

Languages

License

dkorenci/doc-topic-coherence

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages