Build Knowledge Graph from unstructured corpus using Machine Learning

Problem of creating knowledge graph from unstructured data is a well known machine learning problem. Not even a single org has achieved 100% accuracy for completely enriched knowledge graph . I have few findings that will help to kick-start for a person who is new in to this .

Before move to findings , i will let you to walk through the problem of building knowledge graph from unstructured corpus . Lets consider this scenario . Suppose we have very small corpus :

"Apple was founded by Steve jobs and current CEO is Tim Cook. Apple launched several products like Ipad, iphone , MAC etc. "

Corpus may be very complex sentences also . Problem is how can we build a knowledge graph out of this unstructured corpses . If we create generic knowledge graph , then our system should be able to provide answers like "who founded Apple ?" , " What are products launched by Apple ?" etc .

Few techniques to create knowledge graph :

1.) Supervised Technique :

Supervised models used in the field of information extraction involve formulation of the problem as a classification problem and they generally learn a discriminative classifier given a set of positive and negative examples. Such approaches extract a set of features from the sentence which generally include context words, 3 part of speech tags, dependency path between entity, edit distance, etc. and the corresponding labels are obtained from a large labelled training corpus.

Sentence Segmentation : It will take input as a raw corpus and split it in to multiple sentences which is basically a list of strings.
Tokenization : It will take list of splitted sentences and convert it in to tokens which is basically a list of list of strings.
POS Tagging : It will convert in to pos tagged sentences which is basically list of list of tuples.
Entity detection : it will detect entity and create chunk of sentences which is basically a list of trees.
Relation detection: It will classify whether the particular relation satisfies the given entity set.

Here is few more points about supervised approach and its pros and cons.

It needs a set of relation types.
A named entity tagger
Lots of Labeled data (Break into training set, development set and test set)
Feature representation
A classifier (Naïve Bayes, MaxEnt, SVM, …)

Here is all the features that we can have in supervised approach

Lightweight features – require little pre-processing

Words: headwords, bag of words, bigrams (between, before or after)
Entity type: PERSON, ORGANIZATION, FACILITY, LOCATION & Geo-Polotical Entity/GPE
Entity level: NAME, NOMIAL & PRONOUN

Medium-weight features – require base phrase chunking

Base phrase chunk paths
Bags of chunk heads

Heavyweight features – require full syntactic parsing

Dependency tree paths between entities
Parse tree paths between entities

Pros :

Could be adapted to a different domain
High accuracy with enough hand-labeled training data and test similar enough to training

Cons:

Have to label a large training set (expensive)
Could not generalize well to different genres
Extension to high order Entity relation is difficult as well.

2.) Semi-Supervised Technique :

Such methods start with some known relation triples and then iterate through the text to extract patterns that match the seed triples. These patterns are used to extract more relations from the data set and the learned relations are then added to the seed examples. This method is repeated till no more relations can be learned from the data set.

One more popular algorithms algorithm of it is Snowball ML algorithm.

SnowBall algorithm:
1.) Start with seed set R of tuples.
2.) Generate set P of patterns from R . Compute support and confidence for each Pattern in P and discard those pattern with low support or confidence.
3.) Generate new Set T of tuples matching patterns P . Compute confidence of each tuple in T , add to R the tuples t in T with conf(t) > threshold.
4.) Go back to step 2.

Further illustration of algorithm:

1.) Start with Seed examples

2.) Use entity tagger to tag entities
3.) Grab the extracted pattern
In general , pattern is of the 5-tuple form : (left,tag1,mid,tag2,right)

Tag1,tag2 are named entity tags. Left,mid,right are vectors of weighted terms

4.) Cluster the patterns and filter the pattern in each cluster by compute support and confidence
5.) Using the patterns , scan the collection to generate new seed tuples

Initial seed tuple will be of the form : (tag1,tag2,tag3,tag4 etc)
Example : (organization , product,location etc) so seed example may be (Apple,ipad,california) or (ibm,db2,Armonk ) etc.

Pros :

avoid labeling manually lots of data

Cons:

Require seeds for each relation (quality of the original set of seeds is important)
Big problem of semantic drift at each iteration
Not high precision

3.) Distant Supervision Approach

It uses a database of relations Freebase to get lots of training examples. We build a feature vector in the training phase for an ‘unrelated’ relation by randomly selecting entity pairs that do not appear in any Freebase relation and extracting features for them .
We use a multi-class logistic classifier optimized using L-BFGS with Gaussian regularization. Our classifier takes as input an entity pair and a feature vector, and returns a relation name and a confidence score based on the probability of the entity pair belonging to that relation. Once all of the entity pairs discovered during testing have been classified, they can be ranked by confidence score and used to generate a list of the n most likely new relation instances.