Fintech Key-Phrase

We propose a new dataset, Chinese Financial & High-tech Dataset (Fintech Key-Phrase) in Information Retrieval, which is derived from the publicly released Chinese Management’s Discussion and Analysis (CMD&A). To the best of our knowledge, together with more than 1.2K human-annotated instances, Fintech Key-Phrase is the largest also reliable Chinese benchmark for the Expression-level Information Extraction task.

DataSet Statistics

The High-tech CMD&A annual reports we collect are total more than 16,600 documents, and the documents have recorded up to 2692 different companies' annual business reports. The high-technology CMD&A documents contains about 11171 words in average, the maximum length and the minimum length in the documents is 115 and 32006, respectively.

The below Figure describes the statistics of the document lengths and the document released time in different interval.

Train & Test Set Split

The training set we split contains annotated more than 35,884 Fin-tech domain key-phrases which contains 11,434 different key-phrases after removing duplicated phrases, and the test set contains more than 1769 Fin-tech key-phrases which contains 1,439 different key-phrases after removing duplicated phrases.

The Figure below statistics the key-phrases' counts of different length segment intervals.

From the statistics, we can obviously observe that the majority of Fin-tech domain key-phrases are scattered in the length range from 1 to 6, within a smart part of key-phrases whose length is more than 7. This observation indicate that, generally, the key-phrases in which the financial experts are interested is short and simply.

Real-time Model Prediction

  • BERT-Linear

  • BERT-CRF

  • BERT-BiLSTM-CRF

  • RoBERTa-Linear

  • RoBERTa-CRF

  • RoBERTa-BiLSTM-CRF

Released Application Programming Interface (API)

Introductions about the released API.


Hope you enjoy it !

.

  • 10,0 Max Threading
  • 10,0 GB Max Capacity
  • 2048,0 KB Max Bandwidth
  • 10,0 Max Connections
  • 498,0 PageView Counts

Conclusion and Perspective

1) We present a new dataset, named Chinese Financial & High-tech Based Key-Phrase (Fintech Key-Phrase), which can be regarded as the newest Expression-level Information Extraction benchmark in the Chinese Financial & High-tech specific domain.
2) We conduct comprehensive experiments by utilizing several SOTA approaches (Six Models which are shown in the "Real-time Model Prediction" part). Experiments demonstrate that our dataset can serve as solid baselines for future Information Extraction related researches.
3) In this website, we have released the well-trained SOTA models and corresponding APIs for extracting key-phrases in the Chinese Financial & High-tech specific domain.