Создание PMML для конвейера классификации текста в Python

Я пытаюсь создать PMML (используя jpmml-sklearn) для конвейера классификации текста. Последняя строка в коде - sklearn2pmml (Textpipeline, «TextMiningClassifier.pmml», with_repr = True) - вылетает.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn2pmml import PMMLPipeline

categories = [
'alt.atheism',
'talk.religion.misc',
]

print("Loading 20 newsgroups dataset for categories:")
print(categories)
data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))

Textpipeline = PMMLPipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])

Textpipeline.fit(data.data, data.target)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(Textpipeline, "TextMiningClassifier.pmml", with_repr = True)

Похоже, что sklearn2pmml () не может принимать Textpipeline в качестве входных данных. Код отлично работает для других конвейеров (примеры здесь: https://github.com/jpmml/sklearn2pmml ), но не для конвейера классификации текста выше. Итак, мой вопрос: как мне создать PMML для задачи классификации текста?

Я получаю ошибку:

    Jun 15, 2017 12:48:00 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 489 ms.
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run
INFO: Converting..
Jun 15, 2017 12:48:01 PM sklearn2pmml.PMMLPipeline encodePMML
WARNING: The 'target_field' attribute is not set. Assuming y as the name of the target field
Jun 15, 2017 12:48:01 PM sklearn2pmml.PMMLPipeline initFeatures
WARNING: The 'active_fields' attribute is not set. Assuming [x1] as the names of active fields
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter
 at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:263)
 at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:164)
 at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:124)
 at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
 at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
 at org.jpmml.sklearn.Main.run(Main.java:144)
 at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter
 at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:263)
 at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:164)
 at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:124)
 at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
 at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
 at org.jpmml.sklearn.Main.run(Main.java:144)
 at org.jpmml.sklearn.Main.main(Main.java:93)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Data\Anaconda2\lib\site-packages\sklearn2pmml\__init__.py", line 142, in sklearn2pmml
 raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

person Nikhil Garge    schedule 15.06.2017    source источник


Ответы (1)


Вам необходимо использовать PMML-совместимую функцию токенизации текста. Реализация по умолчанию - class sklearn2pmml.feature_extraction.text.Splitter:

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn2pmml.feature_extraction.text import Splitter
vectorizer = TfidfVectorizer(analyzer = "word", token_pattern = None, tokenizer = Splitter())

Дополнительные сведения и ссылки доступны в списке рассылки JPMML: https://groups.google.com/forum/#!topic/jpmml/wi-0rxzUn1o

person user1808924    schedule 15.06.2017