These are some of the projects I have authored or have contributed to in my workplace. I cannot go into technical and implementation details about them because of company policies. However, I will give an overview and talk about the problem which the work tries to solve.
1. Mapping Technical Data to Business Data and Creating a Semantic Data Lake
In large organizations with multiple offices strewn across the globe, there is a disparity between the Technical team and the Business team. So, any business requirement needs multiple back and forth between the Business Team and the Technical Team before enough data is presented in a format where the business decision can be taken.
This is just one of the multiple problems which occur when there is an unscaled explosion of unstructured data and it is not tied down to one single centralized glossary. In turn, causing huge redundancy and wastes a lot of time for actual work to take place.
Bringing everything under a common structure is difficult because we are working with purely unstructured data.
That is where Natural Language Processing and Machine Learning comes in. To map the unstructured technical and business data from different sources and create a structured semantic data lake in a domain-specific and organization-specific way which increases data integrity and helps in data maintenance.
2. Automated Domain-Specific Quality Check and Correction of Business Glossary at a Large Scale
A glossary is supposed to be sacrosanct. However, building a centralized glossary for a multinational organization is a mammoth task.
Owing to the large scale, the quality of the data is often compromised and the huge task often falls on the project manager or the content manager or the CIO, whoever is in charge, to go through all the data personally and make sure everything is quality-wise top-notch and meets company standards. Needless to say, that is a nontrivial task.
Automatic Quality Checker helps with domain and organization specific checks during the writing of the glossary such as Spelling Check, Acronyms Check, Abbreviations Check, Indicators Check, Attributes Check and automatically suggests the best version of glossary thereby reducing enormously the work of the person who is either building the glossary or the person responsible for approving it.
3. Unsupervised Creation of Ontology from Purely Unstructured Data
In a world of data, Ontology is perceived to be the absolute truth. Generally, Ontology is handcrafted after a lot of research which takes a long time and is limited to the knowledge of the person/s who is creating the Ontology.
We are able to create a Domain-specific and Organizations specific Ontology from purely unstructured data in a completely automated fashion.
Quality wise, the Automatic Ontology built is around 80% as good as the one built by hand. However, it is exponentially faster, highly scalable and also recognizes connections through hidden structures in data which is manually very difficult to find.
Moreover, from an Organization perspective, there is a provision of manual feedback to fine tune the automatically created ontology to make it best suited to their standards.
4. Context-Based Aspect Extraction for Voice of Customer Analysis.
A company needs to know about the client feedback for the services they are providing. In business terminology, it is known as the Voice Of Customer. Along with this, they also want to understand which vertical of their business the aforementioned client is talking about. Moreover, whether the customer in question is a promoter or a demoter.
Doing this manually is a laborious and highly resource-consuming task, where someone has to sit and go through copious amounts of feedback data.
So, we automated it such that an organization can understand in a broad sense the voice of customer for each aspect they are interested in.
For example, we can say, what the customer is saying about a restaurant, with the aspects being, ambiance, food and drinks, service, staff behavior, value for money, etc. So, against each aspect, the restaurant gets a normalized score based on what their patrons are saying.
The product gives an opportunity for businesses to do different analytics and come up with a strategy tackling their distinct aspects individually.
5. Dynamic domain-specific entity and span extraction for curated analytics
There is a lot of information in the world and not a lot of it pertains to us. We are only interested in specific things. With the huge amount of data around us, it is time-consuming to find just what we are looking for.
This solution takes care of that. It can read any amount of text in real-time and return only the spans of text a user is interested in. The product can be fine-tuned for any domain.
6. Domain and Organization Specific Cognitive Analytics for Business Documents
This is a very client-specific solution. But the capabilities include but are not limited to the following cognitive analytics on business documents
1. Document, paragraph, and sentence level context distribution.
2. Client-specific Named Entity extraction
3. Business Concepts extraction, and their semantic, probabilistic, and syntactic similarities to other Business Concepts,
4. Requirements Extraction from running text
5. Impedance Extraction from running text.
The cognitive capabilities change from organisation to organisation and from domain to domain.
7. Automated Hyperparameter Tuning on Machine Learning Algorithms
Sometimes Cross Validation and Grid Search CV is not enough to do hyperparameter tuning to get the best results of the model. This becomes especially true for models which work with very domain specific data. This leads to a lot of experimentation before the deployment of a model.
However, in pipelines or in production, our hands are often tied about the number of times we can manually intervene and run the model with different hyperparameters to get the best results. So, we created, model specific automated hyperparameter tuning which runs before the initialization of a model and thus increasing efficiency.
8. Automatic Phrase Detection from Unstructured Text
This might seem trivial, but one of the biggest problem in NLP is understanding when a word token is an unigram in itself or a part of a bigram, tigram…ngram. It is a non trivial problem to solve.
However, we have come up with a few different variations of how we can solve it using statistical measures, syntactic measures, semantic and machine learning measures.
If you want these as a solution, please mail at [email protected]
One of the major things that I have seen Machine Learning engineers do is to try and solve everything using machine learning. I feel that is an overkill. For me, the majority of the personal projects arose because I was irritated with something or the other and wanted to solve it. Here are a few of them.
1. Deduplication of Reddit Posts
If you are like me and scan Reddit a lot, you would see one of the major problems with the platform is that Redditors karma-whore. Karma whoring is nothing but reposting an already popular post and earning karma on someone else’s content and efforts. The subreddit which really triggers me on a frequent basis was r/jokes where the original content has increased drastically.
So I wrote something that would extract all the posts from a specific Reddit page and say if the similarity scores of two individual posts are too high. If so, they would be tagged duplicate of one another. You can check out the code in the repository linked below, these are really unpolished, but if you want to request any features let me know at [email protected]
Reddit Experiments GitHub
However, I still browse Reddit using the default app without using my script and curse the mods when I see a repeat post.
2. Conversion of Books to AudioBooks
When I was younger, I read a lot of books. So much so, that if anyone wanted to gift me something, I would always ask for a book instead of clothes, toys or anything else.
As I grew older, the reading habit slowly decreased. Nowadays, I want to, but often I cannot concentrate on just reading and would start doing something else after a while. However, I found that audiobooks work really well for me, because I can multitask.
AudioBook platforms were working fine for me, then a friend suggested a book that I could not find anywhere in an audiobook format. So, I wrote a script that can process a PDF or a Text file and can convert it into an audiobook.
You can check it out in the Github repository below. If you want any more features, or want the code to support more formats, please mail at [email protected] The machines still can’t modulate its voice like a voice actor can. However, it gets the job done. However, Google is coming pretty close and will reach there soon.Books to AudioBooks GitHub
3. Youtube Downloader
This was from the time I was tired of ads from Youtube and also wanted to add songs from my Youtube playlist to my ipod. All of the online sites which said they can download from youtube were cancerous. So, I just decided to build my own which can download any youtube, video, audio or playlist.
You can check it out here. If you want additional features, send a mail to [email protected]Youtube Downloader Github
4. Suggesting a Better Auction Algorithm for IPL
We were not satisfied with the way the current IPL auction operated and we thought it was inefficient and redundant. We pitted the current way the Indian Premier League auction works against a novel Demand Quantified Auction which we thought would perform better, and indeed it did by a huge margin; ranging from 35 percent on some aspects to over 175 percent on others.
If you are interested, please go through the final research report