BLOGS
I attended the ACM Conference on Recommender Systems (RECSYS 2018) at Vancouver last week. People who know me personally would probably be a bit surprised, since my claim to being interested in recommenders is based almost solely on having read Satnam Alag's Collective Intelligence in Action, and attending the Coursera course on Recommender Syste...
Sat, 13 Oct 2018 19:13:00 GMT

Last week, I was at our London office attending the RELX Search Summit. The RELX Group is the parent company that includes my employer (Elsevier) as well as LexisNexis, LexisNexis Risk Solutions and Reed Exhibitions, among others. The event was organized by our Search Guild, an unofficial special interest group of search professionals from all th...
Tue, 02 Oct 2018 00:56:00 GMT

One of my (too many) hobbies is to read and learn about leadership, strategy and effective communication. I developed the habit of reading leadership books after I learned about servant leadership — a methodology that teaches one to lead without power over others and without manipulation. Micromanagement in leadership is bad. But this post is not a...
Mon, 01 Oct 2018 12:20:07 GMT

I have spent the last 12 years working with and around cancer data: visualizing it, integrating it, analyzing it. Until recently, I hadn’t questioned the way that biological processes in cancer are explained in text books, in scientific papers or in online videos. I’ve relied on a habit developed during high school (inspired by a brilliant biology ...
Tue, 04 Sep 2018 18:01:01 GMT

Earlier this week, I attended a webinar titled Building the Next Generation Recommendation Engine with a Graph Database hosted by TigerGraph as part of their Graph Gurus series. I attended because I am doing some work with recommenders nowadays, and in a past life, I used to do a lot with graphs (not recommenders), and I was curious how they were...
Sat, 01 Sep 2018 06:43:00 GMT

As a Marine Biologist who has taken a sharp turn after college and ended up graduating with a PhD in Bioinformatics, it’s funny how I keep going back to, learning about and thinking about molecular biology. At one point my dream was to become the “Carl Sagan of Biology” — but that, clearly, hasn’t happened :-) Maybe I felt a need to understand data...
Thu, 30 Aug 2018 12:35:45 GMT

I have been experimenting with keyword extraction techniques against the NIPS Papers dataset, consisting of titles, abstracts and full text of all papers from the Neural Information Processing Systems (NIPS) conference from 1987-2017, and contributed by Ben Hamner. The collection has 7239 papers written by 9785 authors. The reason I preferred thi...
Sat, 11 Aug 2018 18:13:00 GMT

Last week, I attended 2018 NLP@UCSF, a half day event at University of California at San Francisco (UCSF) organized by the UCSF Clinical Data Community Organizing Team. The star of the show was their corpus of 58 million de-identified clinical notes from their own hospital system. Most of the talks were around work that was done with this dataset...
Tue, 22 May 2018 23:45:00 GMT

I recently completed (auditing) the Matrix Factorization and Advanced Techniques course on Coursera, conducted by the same people from the University of Minnesota who gave us the Lenskit project and an earlier awesome course on Recommender Systems on Coursera. Since I am auditing the course, Coursera no longer allows me to submit answers to quizz...
Sat, 05 May 2018 21:27:00 GMT

Earlier this week (April 10 and 11), I was at the Haystack Search Relevance conference at Charlottesville, VA. The conference was organized by Doug Turnbull and Eric Pugh from OpenSource Connections (o19s). Doug Turnbull (and his co-author John Berryman, who was also at the conference) introduced me many years ago to the world of principled searc...
Sun, 15 Apr 2018 21:24:00 GMT

I attended the AWS ML Week at San Francisco couple of weeks ago. It was held over 2 days and consisted of presentations and workshops, presented and run by Amazon Web Services (AWS) architects. The event was meant to showcase the ML capabilities of AWS and was targeted at Data Scientists and Engineers, as well as innovators who want to include Ma...
Sat, 07 Apr 2018 21:54:00 GMT

NLTK users know that a lot of functionality, even seemingly basic ones like sentence and word tokenization, are dependent on machine learning models pre-trained on default corpora. These models are available as a separate download because of their size. Making these models available to your code is simple -- just a single one time nltk.download()...
Sat, 17 Mar 2018 19:23:00 GMT

While clustering some data on Spark recently, I needed a quantitative metric to evaluate the quality of the clustering. Couldn't find anything built-in, so (predictably) went looking on Google, where I found this Stack Overflow page discussing this very thing. However, as you can see from the accepted answer, the Silhouette score by definition is...
Sat, 03 Mar 2018 19:51:00 GMT

We often hear that data is the new oil. Like oil, there is a process of refining data so it can become useful. Classification is by far still the most widely used (and useful) end-product of data. But classification involves training models, which often involves manual labeling work by domain experts to generate the training data. Given that, it ...
Sat, 17 Feb 2018 23:23:00 GMT

Last week, I wrote about using the Snorkel Generative model to convert noisy labels to an array of marginal probabilities for the label being in each class. This week, I will describe the second part of the experiment, where I use these probabilistic labels to train a Discriminative model such as a Classifier. As a reminder, the standard pipeline...
Sun, 04 Feb 2018 06:22:00 GMT

According to its creators at the HazyResearch group at Stanford, Snorkel is a system for rapidly creating, modeling and managing training data. I first heard of it when attending Prof Christopher Ré's talk on his DeepDive project at the Data Science Summit at San Francisco almost 2 years ago. The DeepDive project has since morphed into a lighter-...
Sat, 27 Jan 2018 19:27:00 GMT

Happy New Year! My New Year's resolution for 2018 is, perhaps unsurprisingly, to blog more frequently than I have in 2017. Despite the recent advances in unsupervised and reinforcement learning, supervised learning remains the most time-tested and reliable method to build Machine Learning (ML) models today, as long as you have enough training ...
Sat, 13 Jan 2018 20:12:00 GMT

In terms of toolkits, my Deep Learning (DL) journey started with using Caffe pre-trained models for transfer learning. This was followed by a brief dalliance with Tensorflow (TF), first as a vehicle for doing the exercises on the Udacity Deep Learning course, then retraining some existing TF models on our own data. Then I came across Keras, and l...
Thu, 16 Nov 2017 06:58:00 GMT

Last week a colleague and I were trying to figure out why his network would crash with a NaN (Not a Number) error some 20 or so epochs into training. Lately I have also become more interested in tuning neural networks, so this was a good opportunity for me to suggest fixes based on reasoning about the network. The network itself was built with Ke...
Sun, 29 Oct 2017 00:45:00 GMT

One of the reasons I have been optimistic about the addition of Keras as an API to Tensorflow is the possibility of using Tensorflow Serving (TF Serving), described by its creators as a flexible, high performance serving system for machine learning models, designed for production environments. There are also some instances of TF Serving being use...
Sat, 30 Sep 2017 22:37:00 GMT

Last week, I was at the EMNLP 2017 at Copenhagen. EMNLP is short for Empirical Methods for Natural Language Processing, and is one of the conferences of The Association for Computational Linguistics (ACL) that brings together NLP professionals from academia and industry to talk about their research and present their findings to each other. The co...
Thu, 14 Sep 2017 07:40:00 GMT

If you have been following my last two posts, you will know that I've been trying (unsuccessfully so far) to prove to myself that the addition of an attention layer does indeed make a network better at predicting similarity between a pair of inputs. I have had good results with various self attention mechanism for a document classification system...
Mon, 21 Aug 2017 21:24:00 GMT

In my last post, I described an experiment where the addition of a self attention layer helped a network do better at the task of document classification. However, attention didn't seem to help for another experiment where I was trying to predict sentence similarity. I figured it might be useful to visualize the outputs of the network at each sta...
Sun, 13 Aug 2017 01:53:00 GMT

A couple of weeks ago, I presented Embed, Encode, Attend, Predict - applying the 4 step NLP recipe for text classification and similarity at PyData Seattle 2017. The talk itself was inspired by the Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models blog post by Matthew Honnibal, creator of the spaCy Natu...
Sat, 22 Jul 2017 23:49:00 GMT

Last week I attended (and presented at) PyData Seattle 2017. Over time, Python has morphed from a scripting language, to a library for scientific computing, and lately pretty much a standard language for most aspects of Machine Learning (ML) and Artificial Intelligence (AI), including Deep Learning (DL). PyData conferences cater mostly to the las...
Thu, 13 Jul 2017 02:01:00 GMT

Yesterday I attended the Graph Day SF 2017 conference. Lately, my interest in graphs have been around Knowledge Graphs. Last year, I worked on a project that used an existing knowledge graph and entity-pair versus relations co-occurrences across a large body of text to predict new relations from the text. Although we modeled the co-occurrence as ...
Sun, 18 Jun 2017 22:00:00 GMT

Recently, a colleague and a reader of this blog independently sent me a link to the Simple but Tough-to-Beat Baseline for Sentence Embeddings (PDF) paper by Sanjeev Arora, Yingyu Liang, and Tengyu Ma. My reader also mentioned that the paper was selected for a mini-review in Lecture 2 of the Natural Language Processing and Deep Learning (CS 224N...
Sun, 21 May 2017 02:04:00 GMT

The Deep Learning toolkit I am most familiar with is Keras, having used it to build some models around text classification, question answering and image similarity/classification in the past, as well as the examples for our book Deep Learning with Keras that I co-authored with Antonio Gulli. Before that, I have worked with Caffe to evaluate its p...
Sat, 13 May 2017 23:26:00 GMT

by Corey A HarperCode4lib 2017 was hosted by UCLA on March 6-9, 2017. This was the 12th code4lib conference, and was attended by over 450 library technologists. Amazingly, and despite the increased size, the main conference has managed to stay a single track meeting. This contributes to its appeal, as the shared experience fosters a sense of camara...
Tue, 11 Apr 2017 21:41:20 GMT

by Mike LauruhnCo-located with the Joint Conference on Digital Libraries (JCDL), the 5th International Workshop On Mining Scientific Publications took place on June 22-23 at the Newark, New Jersey campus of Rutgers University. An engaged crowd of about 25 listened to paper presentationas and participated in conversation and networking. Twelve paper...
Tue, 02 Aug 2016 21:33:12 GMT

by Mike LauruhnThe main program of the 2016 ESWC conference took place on May 31 - June 2, 2016. [for my write-up on the Workshops & Tutorials, please see my previous blog post.] For some reason, it seemed fitting to me personally that the main program for ESWC 2016 was book-ended by a pair of talks about owl:sameas.Jim Hendler gave the opening key...
Thu, 30 Jun 2016 22:03:50 GMT

This third and final part of a trip report about NAACL 2016 covers Thursday’s workshop on human-computer question answering. It featured several good talks and posters. Plus, at the end of the workshop the winning system for the quizbowl shared task faced off with the California championship team. This was possible since, conveniently, all the memb...
Wed, 22 Jun 2016 18:31:06 GMT

Earlier I blogged about the NAACL conference. The conference proper ran Monday..Wednesday. It was followed by workshops on Thursday and Friday. I’ll describe Thursday’s workshop on Question Answering in a future posting. This one is about Friday’s workshop on Automated Knowledge Base Completion (AKBC). It was the highlight of the week for me, and I...
Mon, 20 Jun 2016 21:46:52 GMT

The weather was beautiful in San Diego last week during the North American chapter of the Association for Computational Linguistics conference, better known as NAACL.  Lots of interesting stuff on the inside of the meeting hotel as well. The conference and affiliated workshops took all week, so there is too much material for me to describe in a rea...
Mon, 20 Jun 2016 21:07:37 GMT

by Mike LauruhnESWC 2016 took place from May 29th, 2016 to June 2n, 2016 in Crete, Greece. The program had lots to offer in a variety of formats including Workshops, Tutorials, Papers across several tracks and specializations, posters and demos, and keynote speakers.The first two days of the conference offered more than 20 workshops and tutorials. ...
Thu, 09 Jun 2016 19:05:29 GMT

As it turns out, replicating this work is not too hard. You just need to copy and modify TensorFlow's translation example. Here are the tips...Last December at the NIPS conference I got the opportunity to talk with Lukasz Kaiser of Google about the work described in “Grammar as a Foreign Language“. If you are not familiar with the paper, check it o...
Tue, 22 Mar 2016 14:13:24 GMT

Google announced TensorFlow Serving today. The basic notion is simple – take your trained TensorFlow model and make it into a web service running on some scalable hardware. Predictions on demand. The fact that this is part of the TensorFlow way of doing things is not the important bit. (In fact, truth be told, I’m finding TensorFlow to be […]Google...
Wed, 17 Feb 2016 00:10:54 GMT

It seems like I only make time to write on this blog when I’m at a conference. And the reason for that is that I use these as my trip reports. 🙂 I’m at the NIPS conference in Montreal. Things are crazy here with 3700 attendees. The thing about NIPS that is really unusual is […]It seems like I only make time to write on this blog when I’m at a conf...
Wed, 09 Dec 2015 20:23:49 GMT

Yesterday, Databricks announced that they are making Spark debugging easier by their integration of the Spark UI into the Databricks platform. True enough, but don’t confuse “easier” with “easy”. Don’t get me wrong – we have Databricks at work and I love it. But debugging has been its weak point. The Spark UI integration gives […]Yesterday, Databri...
Thu, 24 Sep 2015 21:39:48 GMT

by Mike LauruhnFollowing a fun day at the Jane-athon and a team dinner with friends and colleagues, I was ready for the weekend sessions at ALA Annual. Managing an ALA weekend schedule has always meant making decisions about what to attend and acknowledging that one simply cannot attend every session they would like to. For me, this year's case in ...
Fri, 10 Jul 2015 18:08:16 GMT

by Mike Lauruhn "As a matter of fact, I am registered for the SF #janeathon", the tweet proclaimed. Yes, it was an all-day Jane Austen-related workshop taking place at the American Library Association Annual conference. And no, it was not an endurance competition reading Jane Austen novels or watching their film adaptations. In the official event d...
Wed, 08 Jul 2015 16:21:33 GMT

by Mike LauruhnUCLA's Royce Hall was the setting of the biennial North American Symposium on Knowledge Organization (#NASKO2015) on June 18-19, with the Department of Information Studies serving as host. Over the two days of the symposium, the variety and range of papers was impressive and represented the many different ways that the field of Knowl...
Thu, 25 Jun 2015 19:20:06 GMT

I was at the Spark Summit last week. Lots of interesting talks and some good chats with people in the booths in the exhibit hall. Different people will have different take-aways from the conference; I’d like to call out a couple of big themes and some miscellaneous topics. The big themes are Spark’s growth and […]I was at the Spark Summit last week...
Sun, 21 Jun 2015 22:35:20 GMT

by Curt Kohler & Darin McBeathAt this month's Cincinnati Spark Meetup, Doug Needham from Illumination Works presented some background on graph theory and then we dug into the some code examples using the Spark GraphX library. There were a lot of great examples, discussion and interest in graphs. After the presentation we reviewed some of the highli...
Wed, 17 Jun 2015 18:07:46 GMT

Cincinnati Spark Meetup, April 15, 2015. by Curt Kohler & Darin McBeathThe Cincinnati Spark Meetup continues to attract new members at a brisk pace. After only 3 months we have grown to include over 70 members who are interested in learning about this compell...
Tue, 28 Apr 2015 16:49:07 GMT

Last week I was at the AAAI Symposium on Knowledge Representation and Reasoning (KRR). Check out the schedule at that link, and the accepted papers presented as posters. Lots of good stuff there if, of course, you are into that kind of thing. I think getting to hear Geoff Hinton and Doug Lenat recapitulate the battle […]Last week I was at the AAAI ...
Thu, 02 Apr 2015 05:02:20 GMT