notes

Databases for text storage

Backends

PostgresSQL
MongoDB

Issues

Indexing
Query language
Schema

We'll show how to:

Define an (indexed) collection (table)
Store some tweets in it
Run some queries to fetch those tweets

PostgresSQL

SQL (SELECT * FROM tweets)
Postgres (opensource & free, very fast, text indexing)
Quick note on data types

DB setup

$ createdb texts
$ psql texts

CREATE TABLE users (
    id bigint primary key,
    screen_name text,
    description text);

CREATE TABLE tweets (
    id bigint primary key,
    user_id bigint REFERENCES users(id),
    retweet_of bigint,
    timestamp timestamp,
    text text);

run the statements in tweets.sql to insert data.

CREATE INDEX text_idx ON tweets USING gin(to_tsvector('english', text));

Destroy database

$ dropdb texts

SQL commands to show:

SELECT * FROM tweets WHERE user_id=258022817;
SELECT * FROM tweets WHERE lower(text) LIKE '%obama%';
SELECT * FROM tweets WHERE timestamp > '2014-12-2';
SELECT tweets.timestamp, users.screen_name FROM tweets JOIN users ON tweets.user_id=users.id;


SELECT to_tsvector('english', 'hi there my name is jona i am a programmer and this is a database intro') @@ to_tsquery('programming');

SELECT * FROM tweets WHERE to_tsvector('english', text) @@ to_tsquery('obama');
CREATE INDEX text_idx ON tweets USING gin(to_tsvector('english', text));
% Now this will be faster...
SELECT * FROM tweets WHERE to_tsvector('english', text) @@ to_tsquery('obama');


SELECT user_id, count(*) FROM tweets GROUP BY user_id;
SELECT users.screen_name, count(*) FROM tweets JOIN users ON users.id=tweets.user_id GROUP BY users.screen_name, user_id;

rpostgresql

install.packages("RPostgreSQL")

library("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="texts")

rs <- dbSendQuery(con, "select * from tweets")
tweets <- fetch(rs, n=10)

MongoDB

db.tweets.find()
opensource, free, and very fast
no schema!

mongoexport -h smapp-politics:27011 -d OccupyWallStreet -c tweets -u smapp_readOnly -p 'xxxxxxxx' | mongoimport -d OccupyWallStreet -c tweets

db.tweets.ensureIndex({text: "text" })
db.tweets.findOne({$text: {$search: 'obama'}})

db.tweets.aggregate({$group: {_id: "$user.screen_name", number: {$sum: 1}}})

RMongo

install.packages('RMongo')
library(RMongo)
conn <- mongoDbConnect('OccupyWallStreet')
query <- dbGetQuery(conn, 'tweets', '{$text: {$search: "obama"}}')
data1 <- query[c('timestamp', 'text')]

Databases for text storage

Backends

Issues

We'll show how to:

PostgresSQL

DB setup

Destroy database

SQL commands to show:

rpostgresql

MongoDB

RMongo

Links