Skip to main content

Build Knowledge Graph from unstructured corpus using Machine Learning



Problem of creating knowledge graph from unstructured data is a well known machine learning problem. Not even a single org has achieved 100% accuracy for completely enriched knowledge graph . I have few findings that will help to kick-start for a person who is new in to this .

Before move to findings , i will let you to walk through the problem of building knowledge graph from unstructured corpus . Lets consider this scenario . Suppose we have very small corpus :

"Apple was founded by Steve jobs and current CEO is Tim Cook. Apple launched several products like Ipad, iphone , MAC etc. "

Corpus may be very complex sentences also . Problem is how can we build a knowledge graph out of this unstructured corpses . If we create generic knowledge graph , then our system should be able to provide answers like "who founded Apple ?" , " What are products launched by Apple ?" etc .






Few techniques to create knowledge graph :

1.) Supervised Technique :


Supervised models used in the field of information extraction involve formulation of the problem as a classification problem and they generally learn a discriminative classifier given a set of positive and negative examples. Such approaches extract a set of features from the sentence which generally include context words, 3 part of speech tags, dependency path between entity, edit distance, etc. and the corresponding labels are obtained from a large labelled training corpus.


  • Sentence Segmentation : It will take input as a raw corpus and split it in to multiple sentences which is basically a list of strings.
  • Tokenization : It will take list of splitted sentences and convert it in to tokens which is basically a list of list of strings.
  • POS Tagging : It will convert in to pos tagged sentences which is basically list of list of tuples.
  • Entity detection : it will detect entity and create chunk of sentences which is basically a list of trees.
  • Relation detection: It will classify whether the particular relation satisfies the given entity set.
Here is few more points about supervised approach and its pros and cons.
  • It needs a set of relation types.
  • A named entity tagger
  • Lots of Labeled data (Break into training set, development set and test set)
  • Feature representation
  • A classifier (Naïve Bayes, MaxEnt, SVM, …)
Here is all the features that we can have in supervised approach
  • Lightweight features – require little pre-processing
    • Words: headwords, bag of words, bigrams (between, before or after)
    • Entity type: PERSON, ORGANIZATION, FACILITY, LOCATION & Geo-Polotical Entity/GPE
    • Entity level: NAME, NOMIAL & PRONOUN
  • Medium-weight features – require base phrase chunking
    • Base phrase chunk paths
    • Bags of chunk heads
  • Heavyweight features – require full syntactic parsing
    • Dependency tree paths between entities
    • Parse tree paths between entities
Pros :
  • Could be adapted to a different domain
  • High accuracy with enough hand-labeled training data and test similar enough to training
Cons:
  • Have to label a large training set (expensive)
  • Could not generalize well to different genres
  • Extension to high order Entity relation is difficult as well.

2.) Semi-Supervised Technique :
 
Such methods start with some known relation triples and then iterate through the text to extract patterns that match the seed triples. These patterns are used to extract more relations from the data set and the learned relations are then added to the seed examples. This method is repeated till no more relations can be learned from the data set.



One more popular algorithms algorithm of it is Snowball ML algorithm.

SnowBall algorithm:
1.) Start with seed set R of tuples.
2.) Generate set P of patterns from R . Compute support and confidence for each Pattern in P and discard those pattern with low support or confidence.
3.) Generate new Set T of tuples matching patterns P . Compute confidence of each tuple in T , add to R the tuples t in T with conf(t) > threshold.
4.) Go back to step 2.

Further illustration of algorithm:

1.) Start with Seed examples

2.) Use entity tagger to tag entities
3.) Grab the extracted pattern
In general , pattern is of the 5-tuple form : (left,tag1,mid,tag2,right)
Tag1,tag2 are named entity tags. Left,mid,right are vectors of weighted terms



4.) Cluster the patterns and filter the pattern in each cluster by compute support and confidence
5.) Using the patterns , scan the collection to generate new seed tuples

Initial seed tuple will be of the form : (tag1,tag2,tag3,tag4 etc)
Example : (organization , product,location etc) so seed example may be (Apple,ipad,california) or (ibm,db2,Armonk ) etc.

Pros :
  • avoid labeling manually lots of data

Cons:
  • Require seeds for each relation (quality of the original set of seeds is important)
  • Big problem of semantic drift at each iteration
  • Not high precision

3.) Distant Supervision Approach

It uses a database of relations Freebase to get lots of training examples. We build a feature vector in the training phase for an ‘unrelated’ relation by randomly selecting entity pairs that do not appear in any Freebase relation and extracting features for them .
We use a multi-class logistic classifier optimized using L-BFGS with Gaussian regularization. Our classifier takes as input an entity pair and a feature vector, and returns a relation name and a confidence score based on the probability of the entity pair belonging to that relation. Once all of the entity pairs discovered during testing have been classified, they can be ranked by confidence score and used to generate a list of the n most likely new relation instances.


Pros :
  • Leverage unlimited amounts of text data
  • Allows for very large number of weak features
  • Not sensitive to training corpus: genre-independent


Comments

Post a Comment

Popular posts from this blog

Long-Polling vs WebSockets vs Server-Sent Events

Long-Polling vs WebSockets vs Server-Sent Events  Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols between a client like a web browser and a web server. First, let’s start with understanding what a standard HTTP web request looks like. Following are a sequence of events for regular HTTP request: Client opens a connection and requests data from the server. The server calculates the response. The server sends the response back to the client on the opened request. HTTP Protocol Ajax Polling Polling is a standard technique used by the vast majority of AJAX applications. The basic idea is that the client repeatedly polls (or requests) a server for data. The client makes a request and waits for the server to respond with data. If no data is available, an empty response is returned. Client opens a connection and requests data from the server using regular HTTP. The requested webpage sends requests to the server at

MonoLithic Vs Microservice Architecture | which Architecture should i choose ?

From last few years ,microservices are an accelerating trend . Indeed, microservices approach offers tangible benefits including an increase in scalability, flexibility, agility, and other significant advantages. Netflix, Google, Amazon, and other tech leaders have successfully switched from monolithic architecture to microservices. Meanwhile, many companies consider following this example as the most efficient way for business growth. On the contrary, the monolithic approach is a default model for creating a software application. Still, its trend is going down because building a monolithic application poses a number of challenges associated with handling a huge code base, adopting a new technology, scaling, deployment, implementing new changes and others. So is the monolithic approach outdated and should be left in the past? And is it worth shifting the whole application from a monolith to microservices ? Will developing a microservices application help you reach you

Face Detection model on Image/Webcam/Video using Machine Learning OpenCV

Face detection is a computer technology that is being applied for many different applications that require the identification of human faces in digital images or video. It can be regarded as a specific case of object-class detection, where the task is to find the locations and sizes of all objects in an image that belongs to a given class. The technology is able to detect frontal or near-frontal faces in a photo, regardless of orientation, lighting conditions or skin color . Not Face Recognition! It’s about Detection. Face recognition describes a biometric technology that goes way beyond recognizing when a human face is present. It actually attempts to establish whose face it is. In this article, I’m not going deep into recognizing. I’ll keep that for a future blog article and for the time being, I’m going to explain how to run a simple Face  Detection program using your WebCam with Python Or we can run simple program on image as well . We are dealing with below Cascade models for fac