I previously posted that the Coursehero.com “study help” website was selling old coursework online, and seemed to be web-scraping any academic-related documents they could find. I’ve never had an account with the site, but several of my documents and publications had made their way in, including pages from my website which were never academic-related.
I contacted Coursehero, and after I’d jumped through a formal request hoop, they removed the files I’d identified as mine. They claim that all their content is submitted by members.
I’ve now noticed another oddity with their site. My contact information, which Coursehero appears to have scraped off my resume, has somehow become associated with many random files on the site, including content from courses I’ve never taken, universities I’ve never been to (or heard of), and other random material. It looks like they have an overzealous web-scraping algorithm gobbling up anything it can find online, and attempting to value-add context keywords and metadata. Kind of like those websites that try to sell background checks by listing every possible phone number and name in the hopes that it will increase search engine hits. Data gone rogue. I pity the student who tries to find something on Coursehero with a keyword search, they’ll be sifting through false positives for hours, when they could just as easily be studying something useful or just finding the original (probably free) source that Coursehero grabbed the file from.
I suppose this should be a lesson to me… anything posted online will eventually be scanned by bots and randomly associated to random content on random websites in an attempt to make money (and to think, in high school and college they said we should all have our resumes and work examples out there for the world to see, so we could attract potential employers!).
On the flip side, eventually this blog post will be grabbed and/or linked to by review websites to be found by anyone who cares to check up on Coursehero.com. (You should have noticed by now that the content they are selling is disorganized and mis-categorized at best… and quite possibly available for free elsewhere online…)
I emailed Coursehero to see if they could clean up their database.
From gabe to email@example.com
I came across a few more documents tagged with my contact information.
I don’t recognize most of these from the previews, but as I don’t have
an account, I can’t see the full files. If they’re not mine, I don’t
know why they’re tagged with my name, email, street address, and phone
number. It looks like your scraping algorithm somehow associated my
contact details with random other things in your database.
I searched for my phone number and got 73 results, most appear to be
tagged with text from my resume.
I searched for my website and got 81 results:
I got 64 results for my name. Some of these are tagged with title and
author info from a paper I co-authored, but are widely different
subjects and titles, everything from music to mechanical engineering.
Again, from the previews, I don’t recognize any of these as my
material. It’s possible that one or two might be legitimate uploads by
co-authors of papers I worked on, but many of these files are not even
from the same university that I attended. Overall it seems like your
scraper just tags random files with random data, I can’t see how that
benefits you or your customers, but I’d appreciate it if my contact
info were removed from your database.
Update: Coursehero replied, but didn’t actually address my concern. They claim that my name won’t show up on their site once Google’s search updates, but what about my phone number/address/website matching random documents on their site via their internal search? Perhaps they mean that they’re using Google Site Search or something similar as their website search bar backend, and their keyword-stuffing algorithm will stop associating random files with the contents of my resume after they deleted my resume? I emailed back trying to clarify this.
firstname.lastname@example.org to gabe
Thank you for contacting us.
Please be informed that we have removed the document pertaining your personal information from: <link>. Google periodically crawls Course Hero to check for changes and update their search results accordingly. Once the process is complete, that page should not come up in the Google search results for your name. Please be aware it may take up to a few weeks for Google to re-crawl our website and update the search results.
If you have any questions, please let us know. Thank you.
gabe to email@example.com
I wasn’t talking about Google’s search results, but the search results
on your own site (or do you mean that you’re using Google as the
search backend on your site?)