More Coursehero Database Shenanigans

January 15, 2013

I previously posted that the Coursehero.com “study help” website was selling old coursework online, and seemed to be web-scraping any academic-related documents they could find. I’ve never had an account with the site, but several of my documents and publications had made their way in, including pages from my website which were never academic-related.

I contacted Coursehero, and after I’d jumped through a formal request hoop, they removed the files I’d identified as mine. They claim that all their content is submitted by members.

I’ve now noticed another oddity with their site. My contact information, which Coursehero appears to have scraped off my resume, has somehow become associated with many random files on the site, including content from courses I’ve never taken, universities I’ve never been to (or heard of), and other random material. It looks like they have an overzealous web-scraping algorithm gobbling up anything it can find online, and attempting to value-add context keywords and metadata. Kind of like those websites that try to sell background checks by listing every possible phone number and name in the hopes that it will increase search engine hits. Data gone rogue. I pity the student who tries to find something on Coursehero with a keyword search, they’ll be sifting through false positives for hours, when they could just as easily be studying something useful or just finding the original (probably free) source that Coursehero grabbed the file from.

Funny I don't recall taking an 80,000-level course in Missouri...

Funny, I don’t recall taking an 80,000-level course in Missouri…

I suppose this should be a lesson to me… anything posted online will eventually be scanned by bots and randomly associated to random content on random websites in an attempt to make money (and to think, in high school and college they said we should all have our resumes and work examples out there for the world to see, so we could attract potential employers!).

On the flip side, eventually this blog post will be grabbed and/or linked to by review websites to be found by anyone who cares to check up on Coursehero.com. (You should have noticed by now that the content they are selling is disorganized and mis-categorized at best… and quite possibly available for free elsewhere online…)

I emailed Coursehero to see if they could clean up their database.

From gabe to support@coursehero.com

I came across a few more documents tagged with my contact information.
I don’t recognize most of these from the previews, but as I don’t have
an account, I can’t see the full files. If they’re not mine, I don’t
know why they’re tagged with my name, email, street address, and phone
number. It looks like your scraping algorithm somehow associated my
contact details with random other things in your database.

I searched for my phone number and got 73 results, most appear to be
tagged with text from my resume.
<link>

I searched for my website and got 81 results:
<link>

I got 64 results for my name. Some of these are tagged with title and
author info from a paper I co-authored, but are widely different
subjects and titles, everything from music to mechanical engineering.
<link>

Again, from the previews, I don’t recognize any of these as my
material. It’s possible that one or two might be legitimate uploads by
co-authors of papers I worked on, but many of these files are not even
from the same university that I attended. Overall it seems like your
scraper just tags random files with random data, I can’t see how that
benefits you or your customers, but I’d appreciate it if my contact
info were removed from your database.

Thanks,
-Gabe Emerson

Update: Coursehero replied, but didn’t actually address my concern. They claim that my name won’t show up on their site once Google’s search updates, but what about my phone number/address/website matching random documents on their site via their internal search? Perhaps they mean that they’re using Google Site Search or something similar as their website search bar backend, and their keyword-stuffing algorithm will stop associating random files with the contents of my resume after they deleted my resume? I emailed back trying to clarify this.

support@coursehero.com to gabe

Dear Gabe,

Thank you for contacting us.

Please be informed that we have removed the document pertaining your personal information from: <link>. Google periodically crawls Course Hero to check for changes and update their search results accordingly.  Once the process is complete, that page should not come up in the Google search results for your name.  Please be aware it may take up to a few weeks for Google to re-crawl our website and update the search results.
If you have any questions, please let us know.  Thank you.

——————————————————————————-

gabe to support@coursehero.com

I wasn’t talking about Google’s search results, but the search results
on your own site (or do you mean that you’re using Google as the
search backend on your site?)


Coursehero scrapes your stuff and sells it

December 20, 2012

I happened to come across a website which seems to have grabbed a bunch of files from my website, and is now offering them for sale to college students. It seems to be a nice combination of “Cheating is wrong, but here’s a bunch of other people’s work”, and “We’ll sell you stuff we found for free online”. Two great business models come together! It looks like it’s all a (poorly programmed) automated system that scrapes the net for academic-related material (for example, searching for published term papers, study notes, etc), slaps it behind a paywall, and then promises students a better grade if they sign up. I’m not sure what the actual cost/value of each paper would be, but cruising through their terms-of-use nets this info:

$95.40/yr
$59.85/ quarter (you know, for all those colleges operating on the quarterly schedule…)
$39.95/mo

These rates apparently get you 10 PDF downloads per day. You can also upload any of your content that their auto-scraper happens to miss, but the terms of use say that they don’t pay you anything for doing this (maybe you get a discount of some kind?). Their auto-scraper appears to have a hard time parsing and sorting the stuff it grabs, as it has apparently decided that I am a class (actually that would be kind of funny… GABE101, students learn how to remodel free boats and build potato guns 😛 )

coursehero_screenshot

It also seems to be pretty opportunistic in what it grabs. For example, documents associated with me include everything from actual class content (Powerpoints and notes that I had online), to random content from my website (resume, text files, etc) which would be useless to most students. Yep, you can pay $95.40 for the privilege of reading a .txt file on dumpster diving that I wrote over a decade ago! This will definitely help you pass GABE101, as the midterm grade is based on what you can find in the trash.

coursehero_screenshot_2 coursehero_screenshot_3

All in all, this site seems about as useless as any of those 500 pages you get when you search a phone number (Join OUR phonebook, the NUMBER ONE place for phone-numbers-stolen-from-everyone-else’s number one phonebook!) Googling around a bit, I found some more info on this coursehero operation (these are opinions or reviews from other websites:)

“Coursehero probably isn’t worth the money. The vast majority of the material has been stolen from faculty/university websites. It may be that all the material for your particular school is freely available on the school website.
As anecdotal evidence of this fact there are forms and reports from the Dean’s office of my school on coursehero. The website is a copyright infringing scam to steal your money. It makes me mad that they are stealing money from poor students.”
(From http://answers.yahoo.com/question/index?qid=20100125160855AAu6HwM)

“They are posting stolen content that’s why when you pay and log in, you don’t see most of it because they are being hit hard with DMCA take down letters. This site won’t be up long. Don’t waste your money.”

(from http://m.rateitall.com/i-2055646-course-hero.aspx)

“Yep.  My college has a department named FILES.  Who knew?  Apparently the FILES department teaches something having to do with field exercises.
Yike. To me, this sounds like it is information scraped from unsecured databases or Web site directories. I’d contact the school and let them know that they might have a security issue.”

“I’ll bet there will be lawsuits — they’re clearly just scooping up whatever’s free and available and charging for it.  My stuff’s not open access, so it’s not there.  Remind me to put my name on my handouts….”
(from http://chronicle.com/forums/index.php?topic=63326.10;wap2)

Their terms of service includes a rather clunky method for claiming copyright infringement, so I decided to take a different approach. Waiting to see if they even respond… If not then I guess I can go all DMCA on it 😛

Subject: Content usage From: Gabe Emerson <> To: info@coursehero.com

Hi, I notice you’re offering a collection of my work for sale on your site. The materials appear to have been scraped from my website without permission, as I have not had an account with your site and did not provide the content. Please provide me with a list of access and sales records for my content, and I will send you an invoice for the use of this material. Specifically, some of my material seems to be listed under “Course: EMERS089” http://www.coursehero.com/sitemap/schools/1241-Minnesota/courses/774509-EMERS089/ I am not sure if you have additional materials as well. Sincerely, -Gabriel Emerson

Update: Coursehero wrote back asking me to verify in a more “formal” manner that the files were mine, and were used without permission. I responded, and they appear to have removed them (at least, the ones that I pointed out, since I don’t have an account, I don’t know if anything else of mine is on there).