//-*- coding: utf-8 -*- // stjohnscollege.edu // - lost event because we changed the implied sections algo and no // longer adds the address and store hours as a single implied section // - probably should write this one off // ingramhillusic.com // stop it from teltescoping to the month/year pairs in the blog roll. // detect brothers that are month/year pairs in a list and do not telescope // to them. set their dates as DF_ARCHIVE_DATES //. to fix folkmads.org we should allow the 3 tods to propagate up to the // date above them. then to avoid mult locations for event we should telescope // all pieces of the date telescope kinda like at the same time until we // encounter an address. OR allow an hr tag to propagate up unbtil it hits // text, then set its section therer. // Dates.cpp revision idea: // - i might go so far to say that any time you have different dates in // a section that you are compatible with, then things are ambiguous // and you should give up entirely with the telescope. // - we use this algo for assigning addresses i think to event dates // - we should keep the telescope up until it hits a point of ambiguity // - but if we can contain 2+ dates from the same section in the same // telescope then it is not ambiguous and that is ok.... // - how would this affect our other pages? // - would fix http://www.ingramhillmusic.com/tour/ ? // - would fix stoart.com ? // - would fix christchurchcinnati.com? // -------------------- // list of events with bad times (4) (fix these first) // -------------------- // http://christchurchcincinnati.org/worship // - bad implied section, should be based on h2 tag, but it is based on // a single

tag with heading bit set (METHOD_ATTRIBUTE) i think // - gets some wrong event dates // - 12:10 should not telescope to "the Sundays" because it has // "wednesdays" in its title. do we have bad implied sections? // - misses "ten o'clock" date format // http://milfordtheatreguilde.org/Larceny.htm // - gets some wrong event dates // - seems to be ignore date list: oct 8th, 9th, ... // - easy fix // http://www.contemporaryartscenter.org/UnMuseum/ThursdayArtPlay // - gets some wrong event dates // - allows a store hours to telescope to all possible combos but in this // case it should always telescope to the "summer" in its sentence... // and be required to have that. // - just allow the plain store hours to be a subdate if compared to a // store hours that has a seasonal or month range... // - easy fix // http://www.stoart.com/ // - what happened to datelistcontainer around the dow list? // - eliminate addresses that are picture subtitles or are in // picture galleries. the address is describing the picture not the // event. // - asshole's schedule is not aligned with the dows. he relies on the // browser rendering the two table columns just right... // - should not be allowing those list of tod ranges to telescope to // any dow since the dows are in their own list. i thought i had logic // added recently to prevent this... // - then if such a thing happens, that list of headers should block the // telescoping and we end up with just a bunch of tod ranges, and we // should ignore any even that is just a tod range. // - likewise, July 2010 should not telescope to Saturday then 1:30-3:30pm. // it can telescope to Saturday because we allow telescoping to a list // of headers for the new MULTIPLE HEADER algo, but then the Saturday // can't telescope to a non-contained or brother list of tods // - do not consider vertical lists the same date types, and do not allow // any other dates to telescope to them or past such vertical lists, also // the vertical list must be side-by-side with another vertical list for // this algo to really work. so quite a few contstraints for something // that is ambiguous anyhow, even if in a side-by-side list format. // - and address for events is wrong. // If the wrong address was in a sentence, like I created this work of // art at 1528 Madison Rd. Cinti OH then we could at least look at the // structure of the sentence to deduce that it was not talking about the // events. But it has no sentence context. // - maybe if the address is in a list of other "things", don't use it... // - if the address is in a list of brothers, and the tod of the event is // not a brother in that list, i would say, ignore the address. the idea // being that the list is independent of the tod. i think this could hurt // some good pages though... // - maybe we can fix by noting that the gallery address is unused and // set EV_UNCLEAR_ADDRESS on the events we do find. or EV_NESTED_ADDRESSES, // since one address is like the header of the other... // * HOW TO FIX??? // * HARD FIXES - maybe just leave alone // -------------------- // end list of events with bad times // -------------------- // -------------------- // list of events with bad locations (1) (fix these next) // -------------------- // http://www.so-nkysdf.com/Wednesday.htm // - i think our METHOD_DOW_PURE fixed these implied sections // - but why aren't we getting the "Hex" title? // - ah, our implied sections are the best, they are shifted down by 2! // - why ddin't a A=25 tagid work out? yeah if we did avgsim it would work.. // http://all-angels.com/programs/justice/ // - each event has a school name in the tod sentence, but we are not // recognizing that as a place!! // - need to identify default city/state of a website for getting the schools // * BAD EVENTS // * NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE // * SUPPORT "at the following/these locations/places:" // -------------------- // end list of events with bad locations // -------------------- // st-margarets.org/ // - missing the thanksgiving eve as the title // - telescoping to a fuzzy year range 2009-2010, should make that fuzzy // http://www.ingramhillmusic.com/tour/ // - identify lists of disjoint dates. do not allow those lists to participate // in the telscoping process. then unless the date you are telescoping // from is in that list, you must ignore the dates in that list as far // as telescoping to them as headers. and the dates in that list can't // be the base of a telescope either. // - this might be another way to fix thewoodencow.com // - what about stoart.com, it would prevent the one list of tods from // combining with the other list of dows. so we would lose most of our // events for stoart.com // - this basically would int16_t-circuit our combinatorics approach??? // i.e. "comboTable" in Dates.cpp? // - i might go so far to say that any time you have different dates in // a section that you are compatible with, then things are ambiguous // and you should give up entirely with the telescope. // - how would this affect our other pages? // - or just keep it simple and label the dates as DF_ARCHIVE_DATE since // their month/year list format is very popular. then just ignore such // dates for telescoping. // http://www.guysndollsllc.com/page5/page4/page4.html // - more or less ok. most events are outlinked titles. // http://www.lilcharlies.com/brewCalendar.asp // - Sunday should not map to 4pm-6pm but it does because we think 4pm-6pm // is store hours, but how can we think that? it needs to combine with // a dow in order to be store hours. // - how did we get "Sunday [[]] 4pm - 6pm" ??? // - brbrtagdelim (double br) should be enough to keep the right dow mapping // to the right tod. // - bad titles because we think the strong tag portion is part of a longer // sentence. so do not make sentence go across the strong or bold tag // or italic or underline tag UNLESS the next word is lower case, etc. // so treat these non-breaking tags as we treat the other breaking tags. // - BETTER SENTENCE DETECTION (EASY) // http://sfmusictech.com/ // - hotel kabuki // - we now get the cocktail event again since i added custom-delimeter // implied sections // http://www.guysndollsllc.com/ // - has bad telescope: "until 2:00 a.m [[]] Tuesday through Sunday (Monday)" // which does not have a real start time. should telescope to // "4:00 p.m. until 2:00 a.m." since it should be kitchen hours. // * INCOMPLETE EVENT TIME // * FIX KITCHEN HOURS // * FIX ONGOING EVENT DATE TELESCOPES // http://www.thepokeratlas.com/poker-room/isleta-casino/247/ // - these all seem to be in november 2009 and spidered in may 2010 so the // dates are old // - implied sections need help here really // - 2009 is not being detected as a copyright date which it should be // cuz in a

tag at bottom of // the page. // - BETTER COPYRIGHT DETECTION. telescope around the year's sentence until // we hit other text. search for "copyright" in all tags telescoped to. // http://www.southgatehouse.com/ // - misses title "Yo La Tengo" because it thinks it is in a menu and // gets "Non Smoking Show" as at least the same title score... // - how to fix? // * BAD EVENT TITLES // http://www.cabq.gov/library/branches.html // - we fixed the titles with our new implied sections // - title #12 is in same implied section as #11. why? because missing
// - #1 has a bad event title. why is it getting that google map as title? // http://www.burlingtonantiqueshow.com/7128.html // - if city state follows ()'s which follow street, treat it as inlined still // that way we can get the right address here // - use the alt=directions link as the site venue. should update the venue // algo to look at that. also consider "location" or "how to get here/there" // * DISREGARD ()'s FOR INLINED ADDRESSES // * UPDATE VENUE ALGO // * EASY FIX // http://www.burlingtonantiqueshow.com/ // - no location given, but if we update the venue algo as state above we // can default the location to the venue. // * NEED TWO FIXES ABOVE // * EASY FIX // http://www.junkmarketstyle.com/item/195/burlington-antique-show // - seems to be ok now // http://www.queencityshows.com/tristate/tristate.html // - July 3 & 4 is resulting in empty times but shouldn't be! // * FIX INTERVAL COMPUTATION // * EASY FIX // http://www.thewomensconnection.org/Programs/Monthly_Meetups_For_Women.htm // - need to alias non-inlined street address to its inlined equivalent // * FIX ADDRESS ALIAS ALGO // * EASY FIX // http://preciousharvest.com/feed // - rss content is not expanded... why? need to expand CDATA tags... // * EASY FIX // http://www.andersonparks.com/ProgramDescriptions/YoungRembrandtsSummerCamps.html // - thinks event date is registration date since it is after a // "register now" link. // - do not treat date is registration hours if it is 2 or less hours like // 1 - 2:30pm, because what box office is only open for a few hours? // * EASY FIX // abqcsl.org // - the youth services tod range was telescoping to "Sunday" when we had // an exception inisCopmatible() to fix folkmads.org, which allowed an // isolated tod section to telescope its tod to a section that already had // a tod. but really are the youth services on sunday? that does not // seem clear really... // - 3/14/10 should telescope to the store hours, but because a brother // section has a tod "Oct 18, 1:15PM" it doesn't. // - 3/14/10 is in a datelistcontainer so it can't be a header // - it should not be included anyway because its title is outlinked // - taking out the line in isCompatible() meant for peachpundit.com actually // seems to bring back the 3/14/10 telescoping to sunday hours event // http://www.arniesonthelevee.com/ // - needs support for "all week" to get the store hours i think // http://schools.publicschoolsreport.com/county/NM/Sandoval.html // - misses santo domingo school because we do note recognize the city // "sn domingo pblo" which would inline the "I-25 & Hwy 301" intersection. // - but the elementary school uses a "1" instead of an "I" for "I-25"! // http://yellowpages.superpages.com/listings.jsp?CS=L&MCBP=true&search=Find+It&SRC=&C=bicycles&STYPE=S&L=Albuquerque+NM+&x=0&y=0 // - "2430 Washington St NE" misses latitude because it is not preceded by // a zero nor does it have a decimal point in it // http://www.menuism.com/cities/us/nm/albuquerque/n/7414-south-san-pedro // - has abq,nm BEFORE the street address // - we only got it by luck before because the state was in the name2 // and we were calling addProperPlaces on name1 and name2 ... and the // city abq was in the page title // * WHAT TO DO? -- scan headers for abq nm?????? // http://www.collectiveautonomy.net/mediawiki/index.php?title=Albuquerque // . misses event because it can not associtate UNM with Abq, NM // * NEED BETTER PLACE MAPPING //. http://www.wholefoodsmarket.com/stores/albuquerque/ // - good titles // - "STORES" at end should be a menu header but is not //. http://www.switchboard.com/albuquerque-nm/doughnuts/ // - good titles // - lost phone # in description when we ignored span/font tags. because // it is in a div hide tag. // - thinks switchboard.com biz category line is a menu header now that // implied sections groups it with that... //http://www.zvents.com/albuquerque-nm/events/show/88543421-the-love-song-of-j-robert-oppenheimer-by-carson-kreitzer // - good titles // - gets "Feed Readers (RSS/XML" as possible title // - includes quite a bit of menu cruft, hopefully will fade out // with SEC_MENU... check for 2nd zvents.com url... (it does! see below) // - we should get the actual title but we get "Other future dates...". // i guess we should give a bonus if matches the title tag? // * BONUS IF MATCHES TITLE TAG // http://www.when.com/albuquerque-nm/venues // - getting the place name of the event and not the event name because // the unverified place name has the same title score because it is // not verified, and because it is to the left of the time, it is // preferred then. // * NEEDS MORE PAGES SPIDERED (to verify the place names) // http://www.zvents.com/albuquerque-nm/events/show/88688960-sea-the-invalid-mariner // - gets "Feed Readers (RSS/XML" as possible title // - "Other Future Dates & Times" title... // * BONUS IF MATCHES TITLE TAG // . http://texasdrums.drums.org/albuquerque.htm // - alternating rows in table are all headers... we ignore these for now. // but do we need header identification or something to do right? // - STRANGE TABLE HEADERS //. http://www.usadancenm.org/links.html // - seems ok, but the best titles are mostly lowercase around the times // and we are getting address-y titles for the most part now // * NO CASE PENALTY IF SENTENCE INCLUDES EVENT DATE //. facebook.com // - gets "Full" and "Compact" as part of event description, but those are // options for the "View: ". so we need a special menu detector that // realizes one item in the list will not be a link because it is a // selection menu. then "View:" should be flagged as a menu header. // - any link with a language name like "English (US)" should be // marked as SEC_MENU if in its own section and is a link. // * NEED SELECTION MENU DETECTOR // * IDENTIFY LANGUAGE LINKS AS SEC_MENU // thingstodo.msn.com // - best title is "Bird Walk" in a link, but we miss it. we get // "Upcoming Events" instead because it gets an inheadertag boost. but // if we spider enough pages i would think it would get a penalty from // being repeated on other different event pages. // * NEEDS MORE PAGES SPIDERED //. http://www.collectorsguide.com/ab/abmud.html // - misses jonson gallery address because of no "new mexico" in title // - misses atomic musuem address for same reason // - misses "Friday of every month at 1:30pm -- call for reservations" // because of SEC_HAS_REGISTRATION bit. how to fix? // - good titles // - "last modified: September 24, 2007" should be marked as a last mod // date by Dates.cpp and excluded completely in the min/max event id algo // * IDENTIFY AND IGNORE LAST MODIFIED/UPDATED DATES AND SECTIONS // * ADD META DESCRIPTION like we do titles for places to fix jonson gallery,. //. http://www.abqfolkfest.org/resources.shtml // - american sewing guild is just in strong tags so is not its own // sentence, so the title algo breaks down there. but they might have // just as easily forwent the strong tags, then, how would we get the title? // i would say this is mostly title-less // - "For questions or comments contact the webmaster" ???? dunno... SEC_DUP? // - getting a Last Updated date in the event descriptions too // - lost a title because of TSF_MIXED_TEXT // "Tango Club of Albuquerque (Argentine Tango)". should we split up the // sentence when it ends in a parenthetical to fix that? the new title // is now "DANCE" which is the generic header. // * IDENTIFY AND IGNORE LAST MODIFIED/UPDATED DATES AND SECTIONS //. http://www.unm.edu/~willow/homeless/services.html // - a bad implied section giving us menu crap for the first few events // - we get header cruft for every event, so we need implied sections to // bind the headers to the sections they head. the header are: // Family Health, Child Care, School Perparation, Food, Fathers, // Activities. i think they were bound with the font tags which we got // rid of. // - for "Tue. - Fri. 9 am. - 11 am" title we are missing the event // address in the description... what's up with that? // 101 broadyway does not have address as a title candidate... wtf? was // that on purpose?? no, the other events have address as title candidates // misses "Noon Day Ministry" as title... // - missed "Closed the 1st and 15th of each month;" // - recognize "(no Thurs)" as except thursday. // - treat "Fri. pm." as "Friday night" // - missing "801 mountain" event... why? // * BETTER IMPLIED SECTIONS // http://events.mapchannels.com/Index.aspx?venue=628 // - pretty good. has a little menu cruft, but not too bad. // http://www.salsapower.com/cities/us/newmexico.htm // - IGNORE WEBMASTER BLURBS (contact webmaster, webmaster/design...) // - combine copyright, webmaster, advertising blurbs at the end into // a tail section and ignore... // - "interested in advertising with us..." part of tail and probably // would have high SV_DUP score relative to the rest of the scores. // - getting "Instructores" in description of Cooperage event because // it is an isolated header with no elements beneath it, other than // the other header "Santa Fe", which is a header of an implied section. // i mentioned this below and called it the double header bug. // * DOUBLE HEADER BUG // http://www.americantowns.com/nm/albuquerque/events/abq-social-variety-dances-2009-08-22 // - lost event because i guess we added a delimeter-based implied section to // split the two tod ranges into two different "hard" sections. // - perhaps not EVERY dance is held at abq sw dance center, so maybe it is // a good/safe thing that we do not get that event any more. // - old comments: // - title is good // - event description has some menu cruft in it: // - getting view by date, view by timeframe, view by category list menu // headers in event description // - has some real estate agent headers which is not seen as a menu // header because it only has one link in its menu // - has navigation links "Add Your business or group" which // are not 100% in a link, but they are in a list were each item in that // list does have a link in it, maybe make that exception to the SEC_MENU // algo, that if the section does contain link text it is acceptable, // even if it also contains plain text. // - lone link "See All Cities in New Mexico". how to fix? // * SUPPORT FOR SINGLE LINK HEADER IDENTIFICATION // http://www.ceder.net/clubdb/view.php4?action=query&StateId=31 // - titles and descriptions seem pretty good. // http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11 // - titles and descriptions seem pretty good. // http://www.meetup.com/Ballroom-Dance-in-Albuquerque/ // - has a list of languages (language menu) // - has a trademark blurb "trademarks belong to their respective owners" // - has a "Read more" link that goes to another page at end of event desc. // * LANGUAGE MENU // http://www.abqtango.org/current.html // - has one bad title because case is bad: // "Free introductory Argentine Tango dance class" and ends up getting // less good titles. // - misses another good title because it has "business district" in // lower case when it shouldn't really. // - so we are missing some good titles because of our case penalty... // perhaps we should not do that if the sentence includes the event date??? // * NO CASE PENALTY IF SENTENCE INCLUDES EVENT DATE // http://www.sfreporter.com/contact_us/# // - good title "business hours" now // - has some menu cruft // - has a "search" section with a bunch of forms and we get the form // headers in the event description // * FORM TAG HEADER DETECTION // http://pacificmedicalcenters.org/index.php/where-we-are/first-hill/ // - good titles // - get some doctor's names that were not labeled as SEC_MENU because // they were by themselves in the list. how to fix? // * SUPPORT FOR SINGLE LINK HEADER IDENTIFICATION // http://www.santafeplayhouse.org/onstage.php4 // - bad implied sections for TIcket Price header etc. but we still get the // correct dates though // . later we should probably consider doing a larger partition first // then partitioning those larger sections further. like looking // ahead a move in a chess game. should better partition // santafeplayhouse.org methinks this way. // - give bonus points if implied section ends on a double
br tag? // - bad titles... // - penalizing "Performance Dates:" because it has a colon, even // though it is a header for a list of brothers. maybe do not penalize // under such conditions. this would fix the "pay-what-you-wish" title too! // - getting bad title "Pay-what-you-wish" which is actually a "price" in // the ticket prices table. maybe we should penalize event titles in // registration sections? or treat it as "free" (h_free in Events.cpp) // so we think of it has another price point. or count it for "dollarCount" // in Events.cpp. // * NO HAS_COLON PENALTY if is header of a list of things // realtor.com // . both urs have the lat/lon twice, but the first pair misses the negative // sign in front of the lon and therefore it throws our whole lat/lon algo // out of sync and we miss the next lat/lon pair which is the real deal // new event urls to do: // http://www.weavespindye.org/?loc=8-00-00 // - no tod so no events // - has no addresses // - has one iframe, we support it // http://www.thewoodencow.com/ // - we get store hours as events, but has unrelated events in description // because it is talking about things going on, but with no dates, and // only a "read more" link for each thing. // * REMOVE UNRELATED BLURBS FROM EVENT DESCRIPTIONS ("read more links") // * REMOVE SINGLE LINKS ("Subscribe (RSS)" link) from desc. // * REMOVE WEBMASTER BLURB ("Office Space theme by Press75.com") from desc. // http://www.thewoodencow.com/2010/07/19/a-walk-on-the-wild-side/ // - similar to root url // * REMOVE SINGLE LINKS ("Subscribe (RSS)" link) from desc. // * REMOVE WEBMASTER BLURB ("Office Space theme by Press75.com") from desc. // http://www.adobetheater.org/ // - seems to be ok. got two event dates. // http://villr.com/market.htm // . made an exception in isCompatible() so the isolated month/day dates // can telescope to the store hours dates section even though that section // has month/day dates already. // . if later have to undo this fix, then put a fix in that since the section // has "every saturday" we should ignore its month/day and allow the // isolated monthdays below to telescope to it. obviously "every saturday" // is not referring to just one monthday... // . NEED SUPPORT FOR "mid November" // . NEEDS SUB-EVENT SUPPORT // http://blackouttheatre.com/Blackout_Theatre/Upcoming_Productions.html // . has "the box performance space" but could not find a default venue // address on the website, and could not link this space to Abq, NM // . NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE (by inlinkers?) // http://vortexabq.org/ // - pretty hardcore // - calls javascript to open the real content though and we need to support // that: http://vortexabq.org/ProdnProcessing.php // - has "reqa.open("GET","ProdnProcessing.php");" and we need that file // - misses address: 2004½ Central Ave. SE, Albuquerque, NM 87106 // but might be a copyright address // * DOWNLOAD JAVASCRIPT IN FUNCTIONS // * SUPPORT ½ in addresses // http://folkmads.org/special_events.html // - misses little sub tod ranges because of the rule: // "if ( (acc1 & acc2) == acc2 ) return false" because the header date // itself already has a tod range so it doesn't care about our tod range. // how to fix? // - i added an exception at the end of isCompatible() to allow the isolated // tods to telescope to the July date, but it was causing the pubdate tod for // piratecatradio.com to telescope to the play time and address, so until we // somehow are sure the tod is not a pubdate tod we have to leave this out // - misses location "abq square dance center" has no city/state to pair with // - we miss o neil's pub why? we can assume new mexico since that is in // the title. then we need to be able to look up a place name with no // city and just a state... // * IF "ABQ" is in PLACE NAME, ASSUME CITY IS ABQ for placedb lookup // * NEED TO IDENTIFY THE DEFAULT CITY/STATE of A WEBSITE (by inlinkers?) // http://abqfolkdance.org/ // - misses a few tod range only sub-events because they are in an // SEC_TOD_EVENT section i guess, or the telescopes fail because of the acc1 // algo... but even if in a separate hard section, we should allow the // tod range to telescope to saturday nights if our section is only // tods and tod ranges perhaps??? // "dancing begins at 8:15 and ends around 10:30." // - the TOD ranges in the second section are sub times of the // first section, so they should include the first section in their // event description. we are using his address, right??? // * ADD "ENGLISH" TOD RANGES // * SUPPORT FOR SUB EVENTS // * SUPPORT SPECIAL RANGES: "begins around|at 8:15 and ends around|at 10:30" // http://newmexicojazzfestival.org/ // . is getting the box office hours as events. add to registration keywords. // * ADD MORE REGISTRATION KEYWORDS // * SPIDERED DATE is IN JAN 2010 // www.newmexicomusic.org/directory/index.php?content=services&select=529 // . lost event because it is in the same sentence as "box office" because // the author forgot to put a period in there to separate them into two // different sentences! // . "Call the box office for program information: 888.818.7872 or go online // at www.spencertheater.com Free public tours are offered at 10 a.m. on // Tuesdays and Thursdays throughout the year." // * BETTER SENTENCE DETECTION // http://sybarite5.org/upcoming.htm // - got "January, October, December 2010" as a header because its datebrother // bit was not set because it was at the top of the brother list. false // date header caused us to lose some events. // - support NYC for address like "338 West 23rd St. NYC" // - grabbing part of an event description from something that seems like // it should be paired up with an implied section with the date above it: // "Piotr Szewczyk The Rebel..." should be paired up with // "January 22,23 & 24 2010- 8:00pm" or AT LEAST in its own SEC_TOD_EVENT // section to prevent it from being used as a description for the // event with the date "July 24th 2010 7:30pm" // - event description has another brother event desc in it... why? isn't // the EV_TOD_EVENT working for this??? // - NYC should be recognized sa NY,NY // * BAD EVENT DESCRIPTION // * NEEDS MORE IMPLIED SECTIONS // http://corralesbosquegallery.com/ // - seems to be ok. gets the store hours. // http://web.mac.com/bdensford/Gallery_website/Events_Calendar.html // - the above website's events... // - seems pretty good // http://villr.com/market.htm // - event description sentence mess up? "Los Ranchos Growers' and [[]] ..." // - misses some parts of the event description because of SEC_TOD_EVENT // section flags. but really the brother sections that caused that were // actually subevents of the main date, although they did include a // month and daynum themselves and not a sub tod range as most sub-events // probably do. // * SUPPORT FOR SUB EVENTS (month/daynum based) // http://eventful.com/lawrenceburg/venues/lawrenceburg-fairgrounds-/V0-001-000208596-1 // - has address of lawrenceburg fairgrounds but only as an intersection // * BETTER INTERSECTION ADDRESSES // http://rodeo.cincinnati.com/f2/events/proddisplay.aspx?d=&prodid=3461 // - address has no street number "MainStrasse Village, Main Street // Covington, KY 41011" // - placedb should index streets without their numbers but with zip codes // as if they were place names, like "Tom's Grill, Abq NM". but only // do that if we have a gps point to go with it. // * INDEX STREET NAMES WITHOUT NUMBERS INTO PLACEDB // http://www.scrap-ink.com/ // - all flash, can't parse it // http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11 // - title of "Cost:" is bad because it preceeds colon -70% // - best title is "Beginning Square Dance Lessons, Albuquerque" // - "disclaimer & use" and "Contact New Mexico TOurism Dept" should be // part of a menu! wtf? sentence flip flop? // - we leave out the dollar sign '$' in one of the description sections for // the cost of the event since the section starts with that! // - "More details about this meetup" probably a high SV_DUP and since it // starts with "more" and is in a link, will probably be excluded as a menu // link // - sentence flip flop, "Promote!" should be SEC_MENU! // - "Asst." should be in Abbreviations.h list so that "Asst. Organizers:" // will be just one section, and will have tiny title and desc. score since // prceeds a colon. // - "Trademarks belong to their..." will have high SV_DUP count and therefore // minimal title and desc. score. // - language names in a list should have minimal title and desc score. // but probably no need to detect since SV_DUP will be high eventually. // * for title score ties prefer one close to the event date with highest // m_a // - i would exclude really high SV_DUP dup scores from the title/desc and // index to keep things clear. but we do want to have field names like // "Category" that label other non dup-ish content. so labels are ok, but // not stuff like "More details about this Meetup..." which has a high // SV_DUP count and is not a field name for anything. // http://www.sfreporter.com/contact_us/ // - single store hours "event" // - probably ok but sentence flip flip bug letting in menus? // http://www.publicbroadcasting.net/kunm/events.eventsmain // - lost the guild cinema address, but i do not see nm or "new mexico" // anywhere on the page, so even though albuquerque is right after // "the guild cinema", if we have no state name, we can't make it work... // - SUPPORT CITIES WITH NO STATE NAMES SOMEHOW //mdw left off here do. pacific medical... but fix other bugs first... // http://www.publicbroadcasting.net/kunm/events.eventsmain?action=showCategoryListing&newSearch=true&categorySearch=4025 // getting bad titles of "Date:" // need TSF_DATE_SECTION to penalize title score! so when a // section is a date only, do like x .05 // - need a .90 after colon penalty TSF_AFTER_COLON... // reverbnation.com: // this is a toughy!!! we got a lower case title. we have // multiple bands which is ok, but we are getting categories // like "Latin" and "Bogota, CO" as a title. maybe discount // place names ... // - for every repeated section tag hash, compute a global // average title score, and apply that to boost titles that // might be lower case like "kimo" is on this page. i.e. we // are voting on the best title sections. and we should also // use sectiondb for this as well as this local algo. // - in the case of multiple events // - if section has a prev or next brother with the same taghash // then probably give a "list" TLF_IN_LIST penalty for that // of like maybe .80, not too harsh... // .. consider comparing content of sections where not any dup/nondup voting // info, compare to sections on other websites that do have adequate voting // info, and if similar, maybe use that voting info. might help us nuke // certain types of footers and headers... legal discalimers, etc. brain // kinda works like this. // ** in title tag, allow " - " to split a sentence section // ** prefer the title that matches a section in the title tag then. /* BUT what about burtstikilounge??? all events are lists of links. i guess then we just need to rely on SEC_NOT_DUP???? well kinda, the whole calendar would have SEC_NOT_DUP, but an individual cell of the table could have SEC_DUP and/or SEC_NOT_DUP!! to fix burts: take the list of links that we think is SEC_CRUFT_COMMON then look that up as a whole section and if SEC_NOT_DUP is set then do not set SEC_CRUFT on it otherwise set it !!! does that work? apply to renegade links as well? */ // // missed events: // // http://www.zvents.com/albuquerque-nm/events/show/88688960-sea-the-invalid-mariner // two of the events now have non-outlinked titles. good. but // the second date's title is wrong. // SEA & the Invalid Mariner... // * EV_OUTLINKED_TITLE casualty // * BAD TITLE ("Date", ignore tags, SEC_CRUFT_DETECT bit) // collectorsguide.com: // special subeevnt at jonson gallery starts at 5:30 but in // the next sentence, which actually applies to unm art gallery, // store hours are given up until 4pm, so this cancels out the // 5:30pm and results in empty times. we could check to see if // the header is compatible before we add it??? // - use title expansion algo. should be ok since address will // be included and we should not set EV_OUTLINKED_TITLE. // * BAD DATE HEADER ALGO // * BAD TITLES (need full expansion algo) // abqfolfkest.org // need to do to-brother title section expansion algo. // * BAD TITLES (need full expansion algo) // http://www.guildcinema.com/ // one bad title. // when scanning to set the title in Events.cpp we start at // the first date in the telscope, however we should in this // case start at the 6pm to get the right title. maybe pick // the date with the highest word # to start at, unless it does // not have the smallest headerCount (i.e. unless it is used // in more telescopes as headers than another date) // - set Date::m_headerCount in Dates.cpp at the end of the algo // just loop through the dates and set that count for all // Dates in a telescope not the first ptr. // - so pick the date in the telescope with the highest m_a // unless its m_headerCount is not at the min. // - or would event deduping fix this? // * BAD TITLES (start scan @ highest m_a,min m_headerCount) // http://events.mapchannels.com/Index.aspx?venue=628 // using "Buy Tickets from $xx" as titles. i guess we need to // maybe look at the table column header for "Title"? // * BAD TITLES (add "Buy Tickets*" links to renegade // SEC_CRUFT list) // http://www.salsapower.com/cities/us/newmexico.htm // one title is "$5.00" without the $. maybe stop that. // skip titles that are just a price. // allow dates in titles if in same sentence as would be title. // that should change "with Darrin..." title to // "Tuesdays with Darrin". // "Class at" will change to "Class at 7 p.m." but it really // should be "...The salsa Dance Class at 7 p.m." but i guess // the br tag is breaking the sentence?? we probably need to // really improve our sentence detector to fix that right. // Cooperage event is getting Instructores header as part of // event description because of their double heading sections. // FIX by not taking descriptions from brother sections that are // isolated like that, when you contain its true brother in your // implied section. it is like a bodyless header brother. do not // get descriptions from those, maybe unless it is directly above // you, since it could be a double header, which is rare, but // that is what it is in this case. // * BAD TITLES (full expansion algo?) // * DOUBLE HEADING causing bad heading in event description // http://www.newmexico.org/calendar/events/index.php?com=detail&eID=9694&year=2009&month=11 // has just one event. // title we get is "Cost" and is below the date. we really need // to keep telescoping until we get text above at least one of // the dates in the telescope... so if we discover we have a bad // title then telescope until we got text on top of the lowest // date. try to first get the title before the date. if we // telescope up until we get text before the date, if all the // new section we get before the date is just a title section // looking thing (ignoring the SEC_CRUFT) then maybe that is // the best title. // * BAD TITLES (telescope until text above the date???) // http://www.patpendergrass.com/albnews.html // "saturday morning from 10:00 am - noon" is not telescoping // to "March 19, 2005" like it should... wtf? // * BAD DATE TELESCOPING // http://www.abqtango.org/current.html // one title is "New" so we should ignore that probably. // * BAD TITLES (need full expansion algo probably for others) // http://pacificmedicalcenters.org/index.php/where-we-are/first-hill/ // gets a couple titles wrong. full expansion would fix it. // * BAD TITLES (need full expansion algo probably for others) // http://www.santafeplayhouse.org/onstage.php4 // we do not realize that all these dates are talking about // one event really... so titles are not the best... // also do not parse an except/closed date correctly... // * BAD TITLES (???) // http://www.publicbroadcasting.net/kunm/events.eventsmain?action=showCategoryListing&newSearch=true&categorySearch=4025 // getting bad titles of "Date:" can be fixed with full exp algo. // * BAD TITLES (need full expansion algo) // http://www.dailylobo.com/calendar/ // bad title. only one event so can't maybe do full exp algo. // Title is "Offered". // * BAD TITLE (???) // http://www.burtstikilounge.com/burts/ // there really are no titles. // so we would just take the first item in a calendar day and // ignoring dates would find the title to be outlinked which // is probably a good thing. // however the store hours do not really have brothers so // maybe do not do full expansion on them??? // do not do the full expansion if we have a calendar page like // this because there are often multiple events per daynum... // * BAD TITLES (ignore daynums,...???) // http://upcoming.yahoo.com/event/4888173/NM/Albuquerque/Pet-Loss-Group/The-Source/ // single event. bad title of "Event Photos" which is really // SEC_CRUFT but we do not know it yet. // * BAD TITLE (???) // http://events.kqed.org/events/index.php?com=detail&eID=9812&year=2009&month=11 // has dup event. but really just one event. title is // "Cost:" which is wrong, and the true title is above the date. // consider telescoping until we get text above the date. // * BAD TITLE (telescope until text above the date?) // http://entertainment.signonsandiego.com/events/eve-selis/ // single event. // has title "When". really we need to identify and ignore // the menu cruft better. // * BAD_TITLE (title is "When", telescope til text above) // http://www.mrmovietimes.com/movie-theaters/Century-Rio-24.html // a title is bad, it is now the address of the place. // lost all events because their movie titles were outlinked. // but the movie "2012" survived because its title was bypassed // because D_IS_IN_DATE was set for it! // - try to fix with another site page to set SEC_NOT_MENU // * EV_OUTLINKED_TITLE casualty // * BAD_TITLE ("2012" [the movie]) // http://www.trumba.com/calendars/KRQE_Calendar.rss // - missed address "12611 Montgomery Blvd. NE, Suite A-4 in the // Glenwood Shopping Center" because city is not after or before it, // and i guess before when we did get this address, we had contact info // or something in abq. now i don't see contact info or a venue addr for // trumba, which is right... // - missed "Each weekly program is offered on Sunday at 10:30am with a // repeat on Wednesday at 6:00pm". was only getting them right before // we added the comboTable logic in Dates.cpp to get all date combos, // because of a fluke. really if Sunday and Wednesday were modified // by "every" or were plural then they would not be allowed to telescope // to the daynum/month date, which is causing them to be emptytimes. // - the other trumba.com url i think has a similar issue for the // "Transitioning Professionals..." events, which have meetings every // Tuesday, but the "every" is not right before the Tuesday, so we miss // that too. better safe than sorry! // mdw left off here // http://boe.sandovalcountynm.gov/location.html // missing address: // "960 FORREST RD 10 JEMEZ SPRINGS, NM 87025" // http://www.uniquevenues.com/StJohnsNM // missing address: // "Colorado Office: 225 Main St, Opal Bldg, G-1 Edwards, CO" // does not like the "suite" in between street and city. // http://eventful.com/albuquerque/venues/the-filling-station-/V0-001-001121221-1 // before was protected by SEC_NOT_MENU logic, but now we had to // remove that since SEC_NOT_MENU logic is not reliable. // * EV_OUTLINKED_TITLE casualty // http://events.kgoradio.com/san-francisco-ca/venues/show/4834-davies-symphony-hall // really it is getting bad titles now and should not have // any events since they are all outlinked titles... // * EV_OUTLINKED_TITLE casualty // * BAD TITLES ("Hide") // http://www.zvents.com/albuquerque-nm/venues/show/11865-kimo-theatre // this lost all its events except the store hours, which is // expected behavior now. // * EV_OUTLINKED_TITLE casualty // http://www.when.com/albuquerque-nm/venues // all its events had outlinked titles and it lost them all. good. // * EV_OUTLINKED_TITLE casualty // http://events.kgoradio.com/san-francisco-ca/events/show/88047269-san-francisco-symphony-chorus-sings-bachs-christmas-oratorio // two of the events now have non-outlinked titles. good. but // the second and third dates' titles are wrong. // * EV_OUTLINKED_TITLE casualty // http://events.sfgate.com/san-francisco-ca/venues/show/6136-exploratorium // all its events had outlinked titles and it lost them all. good. // * EV_OUTLINKED_TITLE casualty // http://events.sfgate.com/san-francisco-ca/events/show/88884664-solstice-seed-swap // all of its events but one were lost because of outlinked title. // this is good. // * EV_OUTLINKED_TITLE casualty // http://www.when.com/albuquerque-nm/venues/show/1061223-guild-cinema // all its events had outlinked titles and it lost them all. good. // * EV_OUTLINKED_TITLE casualty // http://www.reverbnation.com/venue/153991 // all its events had outlinked titles and it lost them all. good. // * EV_OUTLINKED_TITLE casualty // http://thingstodo.msn.com/albuquerque-nm/venues/show/1139187-rio-grande-community-farm // "Bird walk" event title was outlinked. // * EV_OUTLINKED_TITLE casualty // http://events.kgoradio.com/ // "New riders of the purple" was an outlinked title. // * EV_OUTLINKED_TITLE casualty // http://blackbirdbuvette.com/ // "Geeks Who Drink" outlinks to another website. // * EV_OUTLINKED_TITLE casualty // // http://events.kgoradio.com/san-francisco-ca/venues/show/4834-davies-symphony-hall // http://eventful.com/albuquerque/venues/sunshine-theater-/V0-001-001214224-7 // only miss events because they are EV_OUTLINKED_TITLE and // SEC_NOT_MENU is not set for them because this is the first // url we index from eventful.com // * EV_OUTLINKED_TITLE casualty // // http://www.smithsonianmag.com/museumday/venues/Albuquerque_Museum_of_Art_History.html // the museum hours have no associated days of week so Events.cpp // ignores them i guess. // * CLOCK DETECTION // * RESPIDER TESTBED // http://www.dukecityfix.com/events/shelter-space-place-belonging // has "808 park ave sw albuquerque" but no state. so we // should assume its abq NM since that is the only city with that // name. // * SUPPORT CITIES WITHOUT STATES // * NEW PLACEDB KEY // http://www.abqtango.org/current.html // misses some events because of those "Next Dates: ..." things // as well as not having city/state for addresses. // i hack fixed the April 2010 header problem by adding // "Details TBA" as an unknown location, but really we should // fix this right with implied sections. just need to section // out with implied sections BASED on the section content... // MORE problems of the same nature. now the 111 harvard event // is telescoping to the "Tues Sept 29 - Sun Oct 4" event. // . http://www.aliconferences.com/conf/social_media_govt1209/index.htm // it says "sign up for your choice of these events: ..." // so we thinkg the event times are registration times. // * REGISTRATION ALGO FIX // http://www.sdcitybeat.com/cms/location/place/stone_brewing_co/147/ // http://music.myspace.com/index.cfm?fuseaction=music.showDetails&friendid=55284962&Band_Show_ID=100037466 // misses the event date because is sets EV_COMMENT_DATE because // i don't have another page from that site that has the same // section tag hash for the event date because they changed their // template! // * NEED MORE DATA // http://www.abqtrib.com/news/2007/may/15/horse-therapy-gives-people-disabilities-opportunit/ // hours dates have bad telecsope: // "Monday 5 to 6 p.m., 6:15 to 7:15 p.m [[]] Jan. 17, 2008" // where jan. 17, 2008 is a date in a link to an article, pets // of the week: jan 17, 2008!! // * do not telescope to dates that are basically clock dates // and SEC_FUTURE_DATE is not set...??? // // http://www.graffiti.org/index/history2008.html // - gets "Cody Hudson" in event desc for gallery hours because the // todSec algo in Events.cpp distributes it to all events beneath it. // - gets the "Tues - Sun..." store hours because it is in its own todevent // section so EV_BAD_STORE_HOURS does not get set. and the other gallery // store hours i guess do not have well defined addresses. // - "Gallery hours are Wednesday through Saturday 11:00 AM - 6pm" // doesn't telescope to "now - November 29, 2008" because we don't // understand "now". so we get some store hours from 2008. // - streets like "60 Avant-Garde Urban Contemporary Female Artists" // - streets like "3 Espacios" // - streets like "4 Barack" // we need partition detection here. seems like all events are // unclear on which addresses belong to them! // - misses "opens on Dec 20th and runs through Feb 7th". needs to // support "runs" so we can make that date a single range. // http://santafe.org/perl/page.cgi?p=maps;gid=2415 // - miss "valentine's day WEEKEND" // - we miss 704 camino lejo, because it does not end in a street indicator. // so we think it is just a regular name of something. // - gets wrong dates too, because one event is between dates of another one. // and we do not see Meem Library as a place name for some reason. // * SUPPORT " WEEKEND" // * GET "704 Camino Lejo" by using tigerdb or recognizing "Camino" like // how we do with "Paseo" // http://www.santafebotanicalgarden.org/mainpages/R_Resources.html // lost address from losing contact info. many just street names // with "Santa Fe" in the section header. maybe we should allow // section headers to be used in addProperPlaces() but then we'd // also need to assume the state is "New Mexico"! but in this // case we also have the place name present, so we could safely // try any city/state combo since we have place name! if we have // the place name and the street address, we should look that up // in placedb as another key!!! might fix christinesaari.com too // * NEW PLACEDB KEY // // santafeplayhouse // has "Santa Fer Playhouse is located at 142 east de vargas // street..." and no city/state! // * NEW PLACEDB KEY // http://www.christinesaari.com/html/news.php?psi=37 : // - does not get "The Kosmos" address because we need to allow // it to use "downtown Albuquerque" and "New HAVEN" // - we need to allow the whole doc to be scanned for states!!! // * SCAN WHOLE DOC FOR STATES // http://www.usadancenm.org/links.html : // without contact info page we miss address: // "111 Maple Street SE (at Central), ABQ" i guess because of // no state. // might be ok if we could lookup name and street as placedb key. // * NEW PLACEDB KEY // http://www.parkingcarma.com/parking_lots/401-MAIN-ST_San-Francisco/26ecdbcc-b80b-dc11-bcd7-0013723eb578/ // lost address from losing contact info. // might be ok if we could lookup name and street as placedb key. // * NEW PLACEDB KEY // http://www.lasg.org/waste/richardson-letter.htm // bad address formation: // "1807 Second St #31\nSF 87505" (SF = Santa Fe) // might be ok if we could lookup name and street as placedb key. // * NEW PLACEDB KEY // address/event loss because of no contact info: // http://www.xeriscapenm.com/xeriscape_gardens.php // http://www.abqtrib.com/news/2007/may/15/horse-therapy-gives-people-disabilities-opportunit/ // http://www.collectorsguide.com/ab/abmud.html (jonson gallery no city/state) // http://www.trumba.com/calendars/albuquerque-area-events-calendar.rss // (fellowship hall has no city/state) // http://obits.abqjournal.com/obits/2004/04/13 (many without city/state) // panjea.org : romy keegan can't telescope to the "store hours" date // because he's in a strange section and the store hours date // section contains all his date type (acc1/acc2 algo). but if // we weaken isCompatible() for store hours then unm.edu url // telescopes the list of store hours it has all over and that // messes up. // - now we miss out all events on panjea.org because if one // sibling contains its own location we assume that every // sibling must. this fixes santafe.org which has some events // in which we do not recognize the location and ended up // telescoping up the the college address header which was the // wrong thing to do. really we need to get better at location // identification. // * BAD DATE HEADER ALGO // * NEED BETTER LOCATION IDENTIFICATION // svrocks.com: "Great American Music Hall" is not made into an address because // we have no convenient city/state to tie it to. but // Silicon Valley is in the title tag... wtf? // * NO CITY STATE SPECIFIED // http://girlsintech.net/conference2010/ // we have a place name and street address but no city/state. // * NEW PLACEDB KEY // http://www.calumetphoto.com/p/events // events in a frame, but robots.txt disallows the frame url! // http://socialfresh.com/tampa/ // now we get the two addresses in frames. // www.marinmommies.com/create-ultimate-gingerbread-house has the bay area // discovery museum, but has no city/state to make an address from it to even // lookup in placedb. SOLUTION: identify dominant city/state of all events // and assume ALL the dominating city/state pairs mentioned on the site in // order to generate addresses. // * NO CITY STATE SPECIFIED // https://www.signmeup.com/site/reg/register.aspx?fid=N42V6K7 // - seems somewhat ok now // http://chamberdailydose.blogspot.com/2009/11/downtown-holiday-fest-begins-tomorrow.html // has "downtown fort wayne" and "tomorrow evening at 6". right // now we do not get any events because of that. // * VAGUE PLACE - downtown fort wayne // * TOMORROW - relative to pub date // http://nightlifegay.blogspot.com/2009/11/its-all-pink-tomorrow-with-gifts.html // has street intersections, and no city/state with them either. // it does mention "in Philadelphia" so we need to support that // and we can use that to turn the street intersections into // addresses in Philly, PA. also uses "tomorrow night" as the // daynum date. // * STREETDB - map street intersections to a city/state // * INTERSECTIONS // * TOMORROW - relative to pub date // http://www.atlantamusicblog.com/news/2009/11/win-tickets-to-hightide-blues-with-death-on-two-wheels-and-tesla-rossa-at-smiths-olde-bar.html // doesn't specify a tod, just says "turkey eve". // * NO TOD - just says "eve" // http://www.washingtonpost.com/wp-dyn/content/article/2009/11/20/AR2009112004036.html // place have streets and various city names, but no adm1s/states. // we can maybe make Address classes using the place name // the given ... how to fix???? // * STREETDB - streets have cities but no states... // * STREETDB - one street has no city or state // * INTERSECTIONS - Church Road and Webster Street NW // * AREA CODES - map area codes to a city/state as well // * ??? - tods are relative to the pub date, but do not say today // http://www.stltoday.com/blogzone/the-blender/the-blender/2009/11/concert-announcement-kenny-rogers-at-family-arena/ // Has "Family Arena" as the place, but no city/state combos // for it to glom onto. the newspaper is st louis based and that // is in the contact info... BUT the arena is in St. Charles // Missouri! // * NO CITY STATE SPECIFIED // http://www.gwair.org/Calendar.html // we associate an event time with the wrong place because // we do not recognize its location since it is a street // intersection // * BAD SPAN TAGS -- title/tod of one event is included in // span tag of the other event, and so it misses out on // it address, which is an intersection. "1st Sundays, 4:00pm" // is the event, and that date is in the event above its // span tag!! wtf... bad html... // http://www.lacrossecenter.com/currentevents.aspx?vm=0&month=12&year=2009&lngCalendarID=14,19 // i don't know how to parse this one! // * BEATS ME /* ---- just if there are multiple candidates of the same date type then do not select any of them in the telescope algo. we then need a delimter based algo to pick headers that are basically at the same level as the stuff the dates "under" them. if when telescoping you encounter the same taghash for the current section THEN stop and store a -1 in there to put a hault to all of them. this assume we have virtual sections implemented. no, just don't allow lists of guys to telescope past their current section no matter what. or maybe you can telescope up until you hit a section that already has your date type. so the 2010-2037 list of years would stay limited to the parent section. and anyone telescoping up to that would take the header date right above them. but for that list of years 2010-2037 nobody would be able to telescope to them. and for the table that has month/year rows intermingled with event rows, all of the same tag hash, it would work too! tr - Nov 2009 tr - 11/13 tr - 11/16 tr - Dec 2009 tr - 12/5 tr - 12/15 BUT other dates can telescope up to their current section. the virtual sections would fix http://www.dailylobo.com/calendar/ = http://10.5.1.203:8000/test/doc.18080536074677915848.html hmmm what about using delimiters for events then??? can we make delimter based sections??? in Sections.cpp?? yeah, then we could use it for events. just look for repeated alternating sections. similar to compression tech? look for section tag hashes that have the same total occNum count and are adjacent... then couple all of them together as a virtual section. do a linear scan down each section. look at sections adjacent to it by going to its m_b and getting that section ptr. skip that ptr if its a parent. get the next sibling after it that is in the EXACT same section it is in. stop if we leave that section however. then for each adjacent sibling count its taghash. stop when we leave the parent section. then look at the counts. get the smallest count. that is how many virtual sections we have. divide all taghash counts by that min count. repeat the scan again... but this time when we hit the divded count for all taghashes store what we got as a virtual section ... and continue doing that to get all virtual sections. -- todo -- might be an alignment issue... check out later */ // TODO: // support every monday, or every third monday , ... // // TODO: // now for a given clock hash is it possible that some pages // use that section for a clock, and other pages do not? let's // wait and see before we do anything about that. // // TODO: // . make a whole new set of urls for pub date detection // . grab that sample set from buzz wiki page // . record the correct pub date for urls in the "qatest123" coll and make sure // we get them each time, otherwise core dump!! // . check the date we extract with the rss feed. that is a good test too! // report on that accuracy in the logs and on the stats page. // . TODO: // mark the time hours that are paired up with a date // then pair up the remaining times with the closest unpaired dates // http://byekoolaidmoms.blogspot.com/2006/11/counting-down.html // . TODO: // look at redir url for pub dates too! pass in firstUrl and redirUrl // from XmlDoc.cpp // . TODO: // support partially split dates. year&month in url, month&day in body: // http://www.semaphoria.com/james/blogger_archives/2004/01/warning-liberal-political-ramblings-to.html // . TODO: // support american/european format dection: // http://nietsvoormij.web-log.nl/nietsvoormij/2007/02/wat_nou_saai.html // . TODO: what to do when 25 hour respider fails to turn up any new info // regarding american/european format? // . TODO: // consider age of page to be when the link was added to the root page. // since we respider roots very frequently we can determine pretty well. // . TODO: // http://harpers.org/archive/2008/12/hbc-90004012 #include "Dates.h" #include "gb-include.h" #include "fctypes.h" #include "Log.h" #include "HashTableX.h" #include "XmlDoc.h" #include "Abbreviations.h" // isAbbr() #define HD_NEW_YEARS_DAY 1 #define HD_MARTIN_DAY 2 #define HD_GROUNDHOG_DAY 3 #define HD_SUPERBOWL 4 #define HD_VALENTINES 5 #define HD_PRESIDENTS 6 #define HD_ASH_WEDNESDAY 7 #define HD_ST_PATRICKS 8 //#define HD_VERNAL_EQUI 9 #define HD_PALM_SUNDAY 10 #define HD_FIRST_PASSOVER 11 #define HD_APRIL_FOOLS 12 #define HD_GOOD_FRIDAY 13 #define HD_EASTER_SUNDAY 14 #define HD_EASTER_MONDAY 15 #define HD_LAST_PASSOVER 16 #define HD_PATRIOTS_DAY 17 #define HD_EARTH_DAY 18 #define HD_SECRETARY_DAY 19 #define HD_ARBOR_DAY 20 #define HD_CINCO_DE_MAYO 21 #define HD_MOTHERS_DAY 22 #define HD_PENTECOST_SUN 23 #define HD_MEMORIAL_DAY 24 #define HD_FLAG_DAY 25 #define HD_FATHERS_DAY 26 #define HD_SUMMER_SOL 27 #define HD_INDEPENDENCE 28 #define HD_LABOR_DAY 29 #define HD_YOM_KIPPUR 30 #define HD_LEIF_ERIKSON 31 #define HD_COLUMBUS_DAY 32 #define HD_MISCHIEF_NIGHT 33 #define HD_HALLOWEEN 34 #define HD_ALL_SAINTS 35 #define HD_VETERANS 36 #define HD_THANKSGIVING 37 #define HD_BLACK_FRIDAY 38 #define HD_PEARL_HARBOR 39 #define HD_ENERGY_CONS 40 #define HD_WINTER_SOL 41 #define HD_CHRISTMAS_EVE 42 #define HD_CHRISTMAS_DAY 43 #define HD_NEW_YEARS_EVE 44 // a delimiter used below -- these are certain types of holidays //#define HD_SPECIFIC_HOLIDAY_MAX 44 #define HD_EVERY_DAY 45 #define HD_SUMMER 46 #define HD_FALL 47 #define HD_WINTER 48 #define HD_SPRING 49 #define HD_WEEKENDS 50 #define HD_WEEKDAYS 51 #define HD_HOLIDAYS 52 #define HD_MORNING 53 #define HD_AFTERNOON 54 #define HD_NIGHT 55 #define HD_MONTH_LAST_DAY 56 #define HD_MONTH_FIRST_DAY 57 #define HD_EVERY_MONTH 58 #define HD_SCHOOL_YEAR 59 #define HD_TTH 60 #define HD_MW 61 #define HD_MWF 62 static int64_t h_funeral; static int64_t h_mortuary; static int64_t h_visitation; static int64_t h_memorial; static int64_t h_services; static int64_t h_service ; static int64_t h_founded; static int64_t h_established; static int64_t h_seniors; static int64_t h_a; static int64_t h_daily; static int64_t h_sunday; static int64_t h_monday; static int64_t h_tuesday; static int64_t h_wednesday; static int64_t h_thursday; static int64_t h_friday; static int64_t h_saturday; static int64_t h_mon; static int64_t h_tues; static int64_t h_tue; static int64_t h_wed; static int64_t h_wednes; static int64_t h_thurs; static int64_t h_thu; static int64_t h_thr; static int64_t h_fri; static int64_t h_sat; static int64_t h_details; static int64_t h_more; static int64_t h_to; static int64_t h_and; static int64_t h_or; static int64_t h_sun; static int64_t h_next; static int64_t h_this; static int64_t h_children; static int64_t h_age; static int64_t h_ages; static int64_t h_kids; static int64_t h_toddlers; static int64_t h_youngsters; static int64_t h_grade; static int64_t h_grades; static int64_t h_day; static int64_t h_years; static int64_t h_continuing; static int64_t h_through; static int64_t h_though; // misspelling static int64_t h_thru; static int64_t h_until; static int64_t h_til; static int64_t h_till; static int64_t h_ongoing; static int64_t h_lasting; static int64_t h_runs; // and runs through static int64_t h_results; static int64_t h_nightly; static int64_t h_lasts; // and lasts through static int64_t h_at; static int64_t h_on; static int64_t h_starts; static int64_t h_begins; static int64_t h_between; static int64_t h_from; static int64_t h_before; static int64_t h_after; static int64_t h_ends; static int64_t h_conclude; static int64_t h_concludes; static int64_t h_time; static int64_t h_date; static int64_t h_the; static int64_t h_copyright; static int64_t h_non ; static int64_t h_mid ; static int64_t h_each ; static int64_t h_every ; static int64_t h_first ; static int64_t h_second ; static int64_t h_third ; static int64_t h_fourth ; static int64_t h_fifth ; static int64_t h_1st; static int64_t h_2nd; static int64_t h_3rd; static int64_t h_4th; static int64_t h_5th; static int64_t h_1; static int64_t h_2; static int64_t h_3; static int64_t h_4; static int64_t h_5; static int64_t h_of ; static int64_t h_year ; static int64_t h_month ; static int64_t h_week ; static int64_t h_weeks ; static int64_t h_days; static int64_t h_months; static int64_t h_miles; static int64_t h_mile; static int64_t h_mi; static int64_t h_km; static int64_t h_kilometers; static int64_t h_kilometer; static int64_t h_night ; static int64_t h_nights ; static int64_t h_evening ; static int64_t h_evenings ; static int64_t h_morning ; static int64_t h_mornings ; static int64_t h_afternoon ; static int64_t h_afternoons ; static int64_t h_in ; static int64_t h_hours ; static int64_t h_are ; static int64_t h_is ; static int64_t h_semester ; static int64_t h_box ; static int64_t h_office ; static int64_t h_during ; static int64_t h_closed ; static int64_t h_closure ; static int64_t h_closures ; static int64_t h_desk; static int64_t h_reception; static int64_t h_st ; static int64_t h_nd ; static int64_t h_rd ; static int64_t h_th ; static int64_t h_sundays; static int64_t h_mondays; static int64_t h_tuesdays; static int64_t h_wednesdays; static int64_t h_thursdays; static int64_t h_fridays; static int64_t h_saturdays; static int64_t h_summers ; static int64_t h_autumns ; static int64_t h_winters ; static int64_t h_noon ; static int64_t h_midnight ; static int64_t h_midday ; static int64_t h_sunset ; static int64_t h_sundown ; static int64_t h_dusk ; static int64_t h_sunrise ; static int64_t h_dawn ; static int64_t h_s; static int64_t h_last; static int64_t h_modified; static int64_t h_posted; static int64_t h_updated; static int64_t h_by; static int64_t h_festival; static int64_t h_register; static int64_t h_registration; static int64_t h_phone; static int64_t h_please; static int64_t h_call; static int64_t h_us; static int64_t h_anytime; static int64_t h_be; static int64_t h_will; static int64_t h_sign; static int64_t h_up; static int64_t h_signup; static int64_t h_tickets; static int64_t h_advance; static int64_t h_purchase; static int64_t h_get; static int64_t h_enroll; static int64_t h_buy; static int64_t h_presale ; static int64_t h_pre ; static int64_t h_sale ; static int64_t h_sales ; static int64_t h_end ; static int64_t h_begin ; static int64_t h_start ; //static int64_t h_closed ; static int64_t h_closes ; static int64_t h_close ; static int64_t h_except ; static int64_t h_open ; static int64_t h_opens ; static int64_t h_happy; static int64_t h_kitchen; static int64_t h_hour; static int64_t h_m; static int64_t h_mo; static int64_t h_f; static int64_t h_late; static int64_t h_early; static int64_t h_since ; static int64_t h_rsvp ; static int64_t h_checkin ; static int64_t h_checkout ; static int64_t h_check ; static int64_t h_out ; static int64_t h_deadline ; static int64_t h_am; static int64_t h_pm; // . record the event times out this many days from the current date. // . make i 12 months out 365... but need to fix min/maxpubdate first #define DAYLIMIT (8*30) static bool isMonth ( int64_t wid ) ; static char getMonth ( int64_t wid ) ; static bool printDateElement ( Date *dp , SafeBuf *sb , Words *words , Date *fullDate ) ; Date **g_dp2 = NULL; Dates *g_dthis = NULL; Dates::Dates() { m_numPools = 0; m_maxDatePtrs = 0; m_ttValid = false; m_tntValid = false; m_bodySet = false; reset(); } Dates::~Dates() { reset(); } #define POOLSIZE 32000 void Dates::reset ( ) { // free pool mem for ( int32_t i = 0 ; i < m_numPools ; i++ ) { mfree ( m_pools[i] , POOLSIZE , "datemempool" ); // to be safe m_pools[i] = NULL; } m_numPools = 0; m_current = NULL; // reset count m_numDatePtrs = 0; m_numTotalPtrs = 0; m_url = NULL; // free that mem too if ( m_maxDatePtrs ) { mfree ( &m_datePtrs[0],m_maxDatePtrs*8,"pmem"); m_maxDatePtrs = 0; } // we have no "best" pub date right now m_best = NULL; m_pubDate = -1; // . 1 means american // . 2 means european // . -1 means unknown m_dateFormat = 0; m_niceness = MAX_NICENESS; m_changed = 0; //m_urlDate = -1; //m_urlDateNum = -1; m_urlYear = 0; m_urlMonth = 0; m_urlDay = 0; m_firstGood = -1; m_lastGood = -1; m_siteHash = 0; m_badHtml = false; m_needQuickRespider = false; m_phoneXorsValid = false; m_emailXorsValid = false; m_todXorsValid = false; m_dayXorsValid = false; m_priceXorsValid = false; m_sftValid = false; m_dateBitsValid = false; m_current = NULL; m_currentEnd = NULL; m_overflowed = false; m_tids = NULL; m_wids = NULL; m_shiftDay = 0; m_setDateHashes = false; m_sections = NULL; m_dateFormatPanic = false; m_calledParseDates = false; m_bodySet = false; } // returns NULL with g_errno set on error Date *Dates::getMem ( int32_t need ) { // sanity check. once we overflow, forget it! you should stop! if ( m_overflowed ) { char *xx=NULL;*xx=0; } // just use multiple pools if ( m_current + need <= m_currentEnd ) return (Date *)m_current; // sanity check if ( need > POOLSIZE ) { char *xx=NULL;*xx=0; } // sanity if ( m_numPools+1 > MAX_POOLS ) { // this error means a static limit was reached so we can't // parse the document g_errno = EBUFOVERFLOW; m_overflowed = true; // this is causing us... //char *u = "unknown"; //if ( m_url ) u = m_url; log("dates: pools overflowed"); return NULL; //char *xx=NULL;*xx=0; } // make a new pool char *pool = (char *)mmalloc ( POOLSIZE ,"datemempool" ); // return NULL with g_errno set on error if ( ! pool ) return NULL; // add it m_pools [ m_numPools++ ] = pool; // set it up m_current = pool; m_currentEnd = pool + POOLSIZE; return (Date *)pool; } // returns NULL and sets g_errno on error Date *Dates::addDate ( datetype_t dt, dateflags_t df,int32_t a, int32_t b, int32_t num){ // make sure we got an acceptable range of word #'s if ( b <= a && b != 0 && a>=0 ) { char *xx=NULL;*xx=0; } // assume up to 100 Date::m_ptrs[] int32_t need = sizeof(Date) + 100 * 4; // point to the new mem Date *DD = getMem ( need ); // problem? g_errno should be set if ( ! DD ) return NULL; // sanity check if ( m_numDatePtrs>=m_maxDatePtrs || m_numTotalPtrs>=m_maxDatePtrs){ // inc by 8k each time int32_t newMax = m_maxDatePtrs + 8000; // how much to realloc to? 8k chunks. int32_t need = newMax * 8; // realloc more char *pmem = (char *)mmalloc(need,"pmem"); // on error g_errno should be set (ENOMEM, etc.) if ( ! pmem ) return NULL; // pointer for parsing up mem char *p = pmem; // start here Date **newDatePtrs = (Date **)p; // skip over p += newMax * 4; // then total ptrs Date **newTotalPtrs = (Date **)p; // skip over p += newMax* 4; // copy over from old arrays for ( int32_t i = 0 ; i < m_numDatePtrs ; i++ ) { // breathe QUICKPOLL(m_niceness); // and copy newDatePtrs[i] = m_datePtrs[i]; // just in case to be safe m_datePtrs[i] = NULL; } // same for other array for ( int32_t i = 0 ; i < m_numTotalPtrs ; i++ ) { // breathe QUICKPOLL(m_niceness); // and copy newTotalPtrs[i] = m_totalPtrs[i]; // just in case to be safe m_totalPtrs[i] = NULL; } // free old crap now mfree ( &m_datePtrs[0] , m_maxDatePtrs * 8, "pmem" ); // update old ptrs m_datePtrs = newDatePtrs; m_totalPtrs = newTotalPtrs; // update max count m_maxDatePtrs = newMax; } //if ( m_numDatePtrs >= MAX_DATE_PTRS ) {char *xx=NULL;*xx=0;} // sanity check - must be from somewhere //if ( ! ( df & DF_FROM_BODY ) && ! ( df & DF_FROM_URL ) ) { //if ( df == 0 ) {char *xx=NULL;*xx=0; } // sanity check if ( dt == 0 ) { char *xx=NULL;*xx=0; } // this are not simple dateflags_t ct = DT_RANGE_ANY | DT_LIST_ANY | DT_COMPOUND | DT_TELESCOPE; // this now too DD->m_arrayNum = m_numTotalPtrs; // keep this for setting m_section/m_hardSection in ::setPart2() m_totalPtrs [ m_numTotalPtrs++ ] = DD; // . add this to tree only if a simple type // . because "ranges" and "lists" takeover the ptr slot of the // first Date ptr in their m_ptrs[] array (inline replacement) if ( ! ( dt & ct ) ) m_datePtrs [ m_numDatePtrs++ ] = DD; // inc used mem m_current += sizeof(Date); if ( dt == DT_MONTH && (num > 12 || num < 1) ) { char *xx=NULL;*xx=0; } if ( dt == DT_MONTH && num == m_urlMonth ) df |= DF_MATCHESURLMONTH; if ( dt == DT_DAYNUM && num == m_urlDay ) df |= DF_MATCHESURLDAY; if ( dt == DT_YEAR && num == m_urlYear ) df |= DF_MATCHESURLYEAR; // assume its a regular tod until we discover that it is really // "[after|before|until] 11pm" if ( dt == DT_TOD ) df |= DF_EXACT_TOD; // and set it DD->m_type = dt; DD->m_flags = df; DD->m_flags5 = 0; // i guess not enough to pass in yet DD->m_hasType = dt; // type accumulator DD->m_a = a; DD->m_b = b; DD->m_maxa = a; DD->m_mina = a; DD->m_numPtrs = 0; DD->m_num = num; DD->m_truncated = 0; DD->m_used = NULL; DD->m_tagHash = 0; DD->m_occNum = 0; DD->m_clockHash = 0; DD->m_tableCell = NULL; DD->m_maxTODSection = NULL; DD->m_calendarSection = NULL; DD->m_lastDateInCalendar= NULL; // used by Events.cpp only DD->m_usedCount = 0; DD->m_mostUniqueDatePtr = NULL; DD->m_section = NULL; DD->m_compoundSection = NULL; DD->m_hardSection = NULL; DD->m_subdateOf = NULL; DD->m_dupOf = NULL; DD->m_dateHash64 = 0LL; DD->m_numFlatPtrs = 0; DD->m_dates = this; //DD->m_sentenceId = 0; //DD->m_containingSection = NULL; /* // sanity check if ( a >= 0 ) DD->m_section = m_sections->m_sectionPtrs[a]; else DD->m_section = NULL; // now set m_realSection Section *sa = m_sections->m_sectionPtrs[a]; // telescope until we hit a "real" section for ( ; sa ; sa = sa->m_parent ) { // get parent //Section *pa = sp->p_parent; // skip section if exactly contained by parent //if ( sp->m_a == pa->m_a && sp->m_b == pa->m_b ) // continue; // these are not real if ( m_sections->isHardSection(sa) ) break; } DD->m_hardSection = sa; */ DD->m_month = -1; DD->m_dayNum = -1; DD->m_year = -1; DD->m_tod = -1; DD->m_dow = -1; //DD->m_minDow = 8; //DD->m_maxDow = 0; DD->m_minYear = 2050; DD->m_maxYear = 1900; DD->m_minTod = 30*3600; DD->m_maxTod = 0; DD->m_minDayNum = 32; DD->m_maxDayNum = 0; DD->m_timestamp = 0; DD->m_suppFlags = 0; DD->m_telescope = NULL; DD->m_headerCount = 0; DD->m_norepeatKey = 0LL; DD->m_dowBits = 0; DD->m_maxYearGuess = 0; DD->m_dowBasedYear = 0; DD->m_minStartFocus = 0; DD->m_maxStartFocus = 0; // set our m_year if ( dt == DT_YEAR ) DD->m_year = num; if ( dt == DT_MONTH ) DD->m_month = num; if ( dt == DT_DAYNUM ) DD->m_dayNum = num; if ( dt == DT_TOD ) DD->m_tod = num; if ( dt == DT_DOW ) DD->m_dow = num; // set min/max dow if ( dt == DT_DOW ) { //DD->m_minDow = num; //DD->m_maxDow = num; // turn on the dow bit if ( num >= 8 ) { char *xx=NULL;*xx=0; } DD->m_dowBits |= (1<<(num-1)); } if ( dt == DT_EVERY_DAY ) { //DD->m_minDow = 1; //DD->m_maxDow = 7; DD->m_dowBits |= (1|2|4|8|16|32|64); } if ( dt == DT_SUBWEEK && num == HD_WEEKENDS ) DD->m_dowBits |= (1|64); if ( dt == DT_SUBWEEK && num == HD_WEEKDAYS ) DD->m_dowBits |= (2|4|8|16|32); if ( dt == DT_SUBWEEK && num == HD_TTH ) DD->m_dowBits |= (4|16); if ( dt == DT_SUBWEEK && num == HD_MW ) DD->m_dowBits |= (2|8); if ( dt == DT_SUBWEEK && num == HD_MWF ) DD->m_dowBits |= (2|8|32); if ( dt == DT_YEAR ) { DD->m_minYear = num; DD->m_maxYear = num; } if ( dt == DT_DAYNUM ) { DD->m_minDayNum = num; DD->m_maxDayNum = num; } if ( dt == DT_TOD ) { DD->m_minTod = num; DD->m_maxTod = num; } if ( dt == DT_TIMESTAMP ) DD->m_timestamp = num; // a special hack for timestamps. we always expect event dates // to have m_ptrs set //if ( dt == DT_TIMESTAMP ) { // DD->m_numPtrs = 1; // DD->m_ptrs[0] = DD; //} // sanity check. do not allow anyone to use 0! if ( num == 0 && ! ( dt & ct ) && dt != DT_TOD ) {//&&dt != DT_MOD ) { char *xx=NULL;*xx=0; } // set DD->m_tagHash if we should if ( a < 0 ) return DD; // get section //Section *sp = m_sections->m_sectionPtrs[DD->m_a]; // int16_tcut //DD->m_tagHash = sp->m_tagHash; // indicate if we are a regular holiday like thanksgiving //if ( DD->m_type == DT_HOLIDAY && // DD->m_num <= HD_SPECIFIC_HOLIDAY_MAX ) // DD->m_suppFlags |= SF_NORMAL_HOLIDAY; // ok! return DD; } int32_t Dates::getDateNum ( Date *di ) { // what date # are we? for ( int32_t i = 0 ; i < m_numDatePtrs ; i++ ) { // breathe QUICKPOLL(m_niceness); if ( m_datePtrs[i] == di ) return i; } return -1; } //#define UNBOUNDED (-1) //#define MAX_TOD (24*3600-1) //static char s_numDaysInMonth[] = { 31,28,31, 30,31,30, 31,31,30, 31,30,31 }; void Date::addPtr ( Date *ptr , int32_t i , class Dates *parent ) { // sanity check - do not overflow if ( m_numPtrs >= 100 ) { char *xx=NULL;*xx=0; } // get his index if ( parent->m_datePtrs[i] != ptr ) { char *xx=NULL;*xx=0; } // avoid "Friday [[]] Friday" //if ( ptr->m_type == DT_DOW && // m_numPtrs == 1 && // m_ptrs[0]->m_type == DT_DOW && // ptr->m_minDOW == m_ptrs[0]->m_minDOW && // ptr->m_maxDOW == m_ptrs[0]->m_maxDOW ) // return; // . nuke him // . NOT if he is a telescope parent though! if ( m_type != DT_TELESCOPE ) { // nuke him parent->m_datePtrs[i] = NULL; // we may replace if ( m_numPtrs == 0 ) parent->m_datePtrs[i] = this; } // preserve all original dates and create a new Date for telescoping w/ else if ( m_numPtrs == 0 ) { // sanity -- shouldn't we call addDate() to realloc? if ( parent->m_numDatePtrs >= parent->m_maxDatePtrs ) { char *xx=NULL;*xx=0; } parent->m_datePtrs[parent->m_numDatePtrs] = this; parent->m_numDatePtrs++; } // sanity check //if ( m_numPtrs == 0 && m_type ) { char *xx=NULL;*xx=0; } // sanity check - must be one of these in order to add ptrs //if ( !(m_flags&(DF_RANGE|DF_LIST|DF_COMPOUND|DF_TELESCOPE))) { // char *xx=NULL;*xx=0;} // sanity check - type must be consistent in lists and ranges //if (!(m_flags & (DF_COMPOUND|DF_TELESCOPE))&& // m_numPtrs>=1&&ptr->m_type!= m_type ) { // char *xx=NULL;*xx=0; } // update word range to be all inclusive for now if ( m_numPtrs == 0 ) { m_a = ptr->m_a; m_b = ptr->m_b; } else if ( m_type != DT_TELESCOPE ) { if ( ptr->m_a < m_a ) m_a = ptr->m_a; if ( ptr->m_b > m_b ) m_b = ptr->m_b; } if ( ptr->m_a > m_maxa ) m_maxa = ptr->m_a; if ( ptr->m_a < m_mina ) m_mina = ptr->m_a; // get crazy stuff out //if ( m_b - m_a > 50 ) { char *xx=NULL;*xx=0; } // ptr hash if ( m_numPtrs == 0 ) m_ptrHash = (uint32_t)(PTRTYPE)ptr; else { m_ptrHash *= 439523; m_ptrHash ^= (uint32_t)(PTRTYPE)ptr; if ( m_ptrHash == 0 ) m_ptrHash = 1234567; } // integrate him into our array m_ptrs [ m_numPtrs++ ] = ptr; // inc used mem parent->m_current += 4; // he ptrs to us //ptr->m_dateParent = this; //bool inherit = true; //if ( ptr->m_flags & DF_CLOSE_DATE ) inherit = false; // . integrate flag in case compound, DF_COMPOUND // . only accumulate if not a closed hours. this fixes // collectorsguide.com which originally had // "Every Sunday before 1pm [[]] Tue-Sun 9-5" but when we added // the closed hours algo it just got // "Every Sunday before 1pm [[]] holidays" which stopped it // from getting "Tue-Sun 9-5" in isCompatible() //if ( inherit ) m_hasType |= ptr->m_hasType; // . no, now i am keeping close dates completely separate from // non-close dates as far as telescoping, etc. goes. they are not // allowed to mix dates. m_hasType |= ptr->m_hasType; // get the ptr's flags datetype_t psflags = ptr->m_suppFlags; // do not inherit the SF_NON flag psflags &= ~SF_NON; // indicate if we are a regular holiday like thanksgiving //if ( ptr->m_type == DT_HOLIDAY && // ptr->m_num <= HD_SPECIFIC_HOLIDAY_MAX ) // psflags |= SF_NORMAL_HOLIDAY; // inherit the "suppFlags" so our DF_BAD_RECURRING_DOW algo works! //if ( inherit ) m_suppFlags |= ptr->m_suppFlags; m_suppFlags |= psflags;//ptr->m_suppFlags; // get flags dateflags_t flags = ptr->m_flags; // take out the DF_STORE_HOURS flags &= ~DF_STORE_HOURS; // like store hours flags &= ~DF_SCHEDULECAND; flags &= ~DF_WEEKLY_SCHEDULE; // and other page flags &= ~DF_ONOTHERPAGE; // is ptr a daynum? if ( ptr->m_hasType & DT_DAYNUM ) flags |= DF_HAS_ISOLATED_DAYNUM; // if we are adding a range ptr then the DF_HAS_ISOLATED_DAYNUM // flag should be stopped at that point, if it even exists if ( ptr->m_hasType & DT_RANGE_ANY ) flags &= ~DF_HAS_ISOLATED_DAYNUM; if ( m_type & DT_RANGE_ANY ) flags &= ~DF_HAS_ISOLATED_DAYNUM; if ( ptr->m_hasType & DT_LIST_ANY ) flags &= ~DF_HAS_ISOLATED_DAYNUM; if ( m_type & DT_LIST_ANY ) flags &= ~DF_HAS_ISOLATED_DAYNUM; m_flags |= flags;//ptr->m_flags; // propagate the new flags bits as well m_flags5 |= ptr->m_flags5; /* // set m_tagHash if we should if ( m_a >= 0 ) { // m_type && ( m_flags & DF_FROM_BODY ) ) { // get section Section *ss = parent->m_sections->m_sectionPtrs[m_a]; // int16_tcut m_tagHash = ss->m_tagHash; // panic - no, parent section has no taghash //if ( m_tagHash == 0 || m_tagHash ==-1) {char *xx=NULL;*xx=0;} } */ // inherit section and taghash and hardsec from first ptr that has it if ( ! m_section && ptr->m_section ) { m_section = ptr->m_section; m_hardSection = ptr->m_hardSection; m_tagHash = ptr->m_section->m_tagHash; m_turkTagHash= ptr->m_section->m_turkTagHash32; if ( ! m_section ) { char *xx=NULL;*xx=0; } // no! i've seen a text only doc that actually has NO hard // sections, so let NULL imply that the hard section is the // root section... //if ( ! m_hardSection ) { char *xx=NULL;*xx=0; } if ( ! m_tagHash ) { char *xx=NULL;*xx=0; } if ( ! m_turkTagHash ) { char *xx=NULL;*xx=0; } } if ( ! (m_flags & DF_FROM_BODY) && m_a >= 0 ) {char *xx=NULL;*xx=0;} // first ptr sets DF_STORE_HOURS if ( m_numPtrs == 1 && ( ptr->m_flags & DF_STORE_HOURS ) ) m_flags |= DF_STORE_HOURS; // if any thereafter is off, we are off then if ( ! ( ptr->m_flags & DF_STORE_HOURS ) ) m_flags &= ~DF_STORE_HOURS; // first ptr sets DF_ONOTHERPAGE if ( m_numPtrs == 1 && ( ptr->m_flags & DF_ONOTHERPAGE ) ) m_flags |= DF_ONOTHERPAGE; // if both do not have DF_ONOTHERPAGE set, then clear it if ( ! ( ptr->m_flags & DF_ONOTHERPAGE ) ) m_flags &= ~DF_ONOTHERPAGE; // nor store hours, BUT since we telescope after setting the // DF_STORE_HOURS bit we will have to set it again or set it // after telescoping //m_flags &= ~DF_STORE_HOURS; // add it back in if we are... //if ( m_hasType==(DT_DOW|DT_TOD|DT_RANGE_TOD|DT_TELESCOPE) ) // m_flags |= DF_STORE_HOURS; //if(m_hasType==(DT_DOW|DT_TOD|DT_RANGE_TOD|DT_RANGE_DOW|DT_TELESCOPE)) // m_flags |= DF_STORE_HOURS; // see if this guy has an ongoing indicator before him //bool ongoing = (m_flags & DF_ONGOING); //if ( ptr->m_flags & DF_HAS_YEAR ) // m_flags |= DF_HAS_YEAR; bool invalid = false; // collision? if disagreement set DF_INVALID if ( ptr->m_month>=1 && m_month!=-1 && ptr->m_month != m_month ) { m_month = -2; invalid = true; } else if ( ptr->m_month >= 1 && m_month == -1 ) m_month = ptr->m_month; if ( ptr->m_dayNum>=1 && m_dayNum>=1 && ptr->m_dayNum != m_dayNum ) { m_dayNum = -2; invalid = true; } else if ( ptr->m_dayNum >= 1 && m_dayNum == -1 ) m_dayNum = ptr->m_dayNum; // set it to -2 so it can't be reset by adding another DT_DOW ptr! if ( ptr->m_dow!=-1 && m_dow>=0 && ptr->m_dow != m_dow ) { m_dow = -2; invalid = true; } else if ( ptr->m_dow != -1 && m_dow == -1 ) m_dow = ptr->m_dow; if ( ptr->m_year >= 1 && m_year>=1 && ptr->m_year != m_year ) { m_year = -1; invalid = true; } else if ( ptr->m_year >=1 && m_year == -1 ) m_year = ptr->m_year; // if we already got dow bits, intersect for telescope pieces if ( m_type == DT_TELESCOPE && m_dowBits && ptr->m_dowBits ) { // nuke him m_dowBits &= ptr->m_dowBits; } // otherwise, accumulate dow bits else { m_dowBits |= ptr->m_dowBits; } // set min/max dow //if ( ptr->m_dow >= 0 ) { // if ( ptr->m_dow < m_minDow ) m_minDow = ptr->m_dow; // if ( ptr->m_dow > m_maxDow ) m_maxDow = ptr->m_dow; //} if ( ptr->m_year != -1 ) { if ( ptr->m_year < m_minYear ) m_minYear = ptr->m_year; if ( ptr->m_year > m_maxYear ) m_maxYear = ptr->m_year; } if ( ptr->m_dayNum != -1 ) { if ( ptr->m_dayNum < m_minDayNum ) m_minDayNum = ptr->m_dayNum; if ( ptr->m_dayNum > m_maxDayNum ) m_maxDayNum = ptr->m_dayNum; } //if ( ptr->m_tod >= 0 ) { // != -1 ) { // if ( ptr->m_tod < m_minTod ) m_minTod = ptr->m_tod; // if ( ptr->m_tod > m_maxTod ) m_maxTod = ptr->m_tod; //} if ( ptr->m_minTod < m_minTod ) m_minTod = ptr->m_minTod; if ( ptr->m_maxTod > m_maxTod ) m_maxTod = ptr->m_maxTod; if ( ptr->m_year >=1 && m_year >=1 && ptr->m_year != m_year ) { m_year = -2; invalid = true; } else if ( ptr->m_year >= 1 && m_year == -1 ) m_year = ptr->m_year; if ( ptr->m_tod >= 0 && m_tod >= 0 && ptr->m_tod != m_tod ) { m_tod = -2; invalid= true; } else if ( ptr->m_tod >= 0 && m_tod == -1 ) m_tod = ptr->m_tod; // inherit non-null calendar sections if ( ptr->m_calendarSection ) m_calendarSection = ptr->m_calendarSection; // if we are a compound or telescope adding a compound ptr then // inherit his junk //if ( ptr->m_month ) m_month = ptr->m_month; //if ( ptr->m_dayNum ) m_dayNum = ptr->m_dayNum; //if ( ptr->m_year ) m_year = ptr->m_year; //if ( ptr->m_tod ) m_tod = ptr->m_tod; // return if we got a range or list if ( m_hasType & (DT_RANGE_ANY|DT_LIST_ANY)) return; /* // collision? if disagreement set DF_INVALID if ( ptr->m_month && m_month && ptr->m_month != m_month ) m_flags |= DF_INVALID; if ( ptr->m_dayNum && m_dayNum && ptr->m_dayNum != m_dayNum ) m_flags |= DF_INVALID; if ( ptr->m_year && m_year && ptr->m_year != m_year ) m_flags |= DF_INVALID; if ( ptr->m_tod && m_tod && ptr->m_tod != m_tod ) m_flags |= DF_INVALID; */ // . set as invalid if there was a collision // . not now since we have strong and weak dows, and the strong dow // can override the weak dow //if ( invalid ) // m_flags |= DF_INVALID; // if we are not adding mon/day/year/tod, then we are done datetype_t st = DT_MONTH | DT_DAYNUM | DT_YEAR | DT_TOD | DT_COMPOUND | DT_TELESCOPE ; if ( ! ( ptr->m_type & st ) ) return; // otherwise, set the appropriate member var //if ( ptr->m_type == DT_MONTH ) m_month = ptr->m_num; //if ( ptr->m_type == DT_DAYNUM ) m_dayNum = ptr->m_num; //if ( ptr->m_type == DT_YEAR ) m_year = ptr->m_num; //if ( ptr->m_type == DT_TOD ) m_tod = ptr->m_num; // return if we do not have at least the month/day/year if ( m_month <= 0 ) return; if ( m_year <= 0 ) return; if ( m_dayNum <= 0 ) return; // make a timestamp based on that stuff tm ts1; memset(&ts1, 0, sizeof(tm)); ts1.tm_mon = m_month - 1; ts1.tm_mday = m_dayNum; ts1.tm_year = m_year - 1900; // use noon as time of day (tod) int32_t tod = m_tod; if ( tod < 0 ) tod = 0; ts1.tm_hour = (tod / 3600); ts1.tm_min = (tod % 3600) / 60; ts1.tm_sec = (tod % 3600) % 60; // . make the time // . this is -1 for early years! m_timestamp = mktime(&ts1); } // . returns false and sets g_errno on error // . returns true on success // . siteHash must be saved in TitleRec and used again when deleting this // from indexdb, so pass in the siteHash that Msg16 uses when it calls // TitleRec::set(...siteHash...) bool Dates::setPart1 ( //char *u , //char *redirUrl , Url *url , Url *redirUrl , uint8_t ctype , // contenttype,like gif,jpeg,html int32_t ip , // ip of url "u" int64_t docId , int32_t siteHash , Xml *xml , Words *words , Bits *bits , Sections *sections , LinkInfo *info1 , //Dates *odp , // old dates from old title rec HashTableX *cct , // cct replaces odp XmlDoc *nd , // new XmlDoc (this) XmlDoc *od , // old XmlDoc char *coll , int32_t niceness ) { //reset(); // if empty, set to NULL if ( cct && cct->getNumSlotsUsed() == 0 ) cct = NULL; // must have been called //if ( ! m_calledParseDates ) { char *xx=NULL;*xx=0; } // save m_coll = coll; m_url = url; m_redirUrl = redirUrl; m_od = od; // save this m_siteHash = siteHash; m_niceness = niceness; m_bits = bits; //if ( bits ) m_bits = bits->m_bits; g_dp2 = m_datePtrs; m_contentType = ctype; // sanity. parseDates() should have set this when XmlDoc // called it explicitly before calling setPart1(). // well now it no longer needs to call it explicitly since // xmldoc calls getAddresses() before setting the implied // sections. and getAddresses() calls getSimpleDates() which calls // this function, setPart1() which will call parseDates() below. //if ( m_nw != words->m_numWords ) { char *xx=NULL; *xx=0; } // . get the current time in utc // . NO! to ensure the "qatest123" collection re-injects docs exactly // the same, use the spideredTime from the doc // . we make sure to save this in the test subdir somehow.. //m_now = nd->m_spideredTime; // getTimeSynced(); m_sections = sections; m_words = words; m_wptrs = words->getWords(); m_wlens = words->getWordLens(); m_wids = words->m_wordIds; m_tids = words->m_tagIds; m_nw = words->m_numWords; m_docId = docId; // for getting m_spideredTime m_nd = nd; // int16_tcut //Sections *ss = sections; // parse up spidered time if we are an open-ended range from here on time_t ts = nd->m_spideredTime; m_spts = localtime ( &ts ); // . the date specified in the rss/atom feed is the best // . the tage from the rss // . loop through the Inlinks Inlink *k = NULL; for ( ; info1 && (k=info1->getNextInlink(k)) ; ) { // breathe QUICKPOLL(m_niceness); // does it have an xml item? skip if not. if ( k->size_rssItem <= 1 ) continue; // check xml for pub date log(LOG_DEBUG,"date: getting pub date from rss"); // make xml from it Xml itemXml; if ( ! k->setXmlFromRSS ( &itemXml , m_niceness ) ) // return false on error with g_errno set return false; // . get the date tag // . is it rss or atom? int32_t dateLen; // false = skip leading spaces char *date = itemXml.getString ( "pubDate",&dateLen,false ); // atom? if ( ! date ) date=itemXml.getString("created",&dateLen,false); // if nothing, go to next if ( ! date ) continue; // rdf? look for dc:date for ( int32_t i = 0; i < itemXml.m_numNodes-1; i++ ) { // breathe QUICKPOLL ( m_niceness ); XmlNode *nn = &itemXml.m_nodes[i]; // skip text nodes if ( nn->m_nodeId == 0 ) continue; // check the node for "dc:date" if ( nn->m_tagNameLen == 2 && nn->m_nodeLen > 7 && strncasecmp(nn->m_tagName,"dc:date", 7 ) == 0 ) { date = itemXml.m_nodes[i+1].m_node; dateLen = itemXml.m_nodes[i+1].m_nodeLen; break; } } // if not there, skip if ( ! date ) continue; // get words Words ww; if ( ! ww.set ( date , dateLen , TITLEREC_CURRENT_VERSION , true , // compute Ids? m_niceness )) // return false with g_errno set on error return false; // determine flag dateflags_t defFlags = DF_FROM_RSSINLINK; // is it local? if ( k->m_ip == ip ) defFlags |= DF_FROM_RSSINLINKLOCAL; // . now parse up just those words // . returns false and sets g_errno on error // . set default flag to indicate from an rss inlink if (!parseDates(&ww,defFlags,NULL,NULL,niceness,NULL,CT_HTML)) return false; } // . get date from the "datenum" meta tag // . this allows client's to use datedb for arbitrary numbers // . for search for meta date // . sometimes "xml" is NULL if just parsing the url if ( xml ) { int32_t metaDate = -1; char buf [ 32 ]; int32_t bufLen ; // do they got this meta tag? bufLen = xml->getMetaContent ( buf,32,"datenum",7); // should be in seconds since the epoch if ( bufLen > 0 ) metaDate = atoi ( buf ); // int16_tcut dateflags_t df = DF_FROM_META; // . add that now too // . this returns false and sets g_errno on error if ( metaDate>0 && ! addDate(DT_TIMESTAMP,df,-1,0,metaDate)) return false; } // int16_tcut char *u = ""; if ( m_url ) u = m_url->getUrl(); // . returns false and sets g_errno on error // . sets m_dateFromUrl // . sets it to -1 if none m_urlYear = 0; m_urlMonth = 0; m_urlDay = 0; int32_t urlTimeStamp=parseDateFromUrl(u,&m_urlYear,&m_urlMonth,&m_urlDay); // add the url date if we had one if ( urlTimeStamp && m_urlDay && m_urlMonth && m_urlYear ) { // int16_tcut dateflags_t df = DF_FROM_URL ; // | DF_NOTIMEOFDAY; // use noon for time of day, which is 17:00 UTC //int32_t tod = (12 + 5) * 3600; // make 3 simple dates int32_t ni = m_numDatePtrs; int32_t nj = m_numDatePtrs+1; int32_t nk = m_numDatePtrs+2; Date *di = addDate (DT_DAYNUM,df,-1,-1,m_urlDay); if ( ! di ) return false; Date *dj = addDate (DT_MONTH ,df,-1,-1,m_urlMonth); if ( ! dj ) return false; Date *dk = addDate (DT_YEAR ,df,-1,-1,m_urlYear); if ( ! dk ) return false; // make a compound date Date *DD = addDate ( DT_COMPOUND,0,-1,-1,0); if ( ! DD ) return false; // and add those 3 simple dates to "DD" DD->addPtr ( di , ni , this ); DD->addPtr ( dj , nj , this ); DD->addPtr ( dk , nk , this ); //if (!addDate(dt,df,-1,0,urlDay,urlMonth,urlYear)) // return false; } // . now get the dates from the body of the doc // . returns false and sets g_errno on error // . make sure "url" is non-null otherwise we are probably in a set2() // call and we do not want to do an infinite recurrence loop //if ( words && u && ! set2 ( words , m_niceness ) ) return false; if ( words && u && ! parseDates (words,DF_FROM_BODY,bits,sections, niceness,m_url,m_contentType)) return false; // . call this a final time and link dates in the same sentence // . linkDatesInSameSentence = true // . fixes santafe.org which has // "The Saturday market is open from 10 a.m.-3 p.m" and we need those // to be linked together in a compound. if ( ! makeCompounds ( words , false , // monthDayOnly? true , // linkDatesInSameSentence? false ) ) // ignoreBreakingTags return false; // try without it as well... //if ( ! makeCompounds ( words , false , false ) ) return false; // if nothing, return now if ( m_numDatePtrs <= 0 ) return true; // sanity check - must be set from parseDates() if ( h_open == 0 ) { char *xx=NULL;*xx=0; } // // now since we no longer set Date::m_section and m_hardSection // in addDate() and addPtr() we have to make up for it here. we are // no longer allowed to use the Sections class in Dates::parseDates() // because Sections::set() calls parseDates() because it uses the dates // to set implied sections that consist of a dom/dow header and tod // subjects. i did hack Date::addPtr() to inherit the m_hardSection, // m_section and m_tagHash from the first ptr though so that the // telescoping code below here will set those things. // // CONSIDER moving this to setpart2 if implied sections are hard // sections. because the sections class does not have implied sections // inserted at this point. // for ( int32_t i = 0 ; i < m_numTotalPtrs ; i++ ) { // breathe QUICKPOLL ( m_niceness ); // int16_tcut //Date *di = m_datePtrs[i]; Date *di = m_totalPtrs[i]; // skip if none //if ( ! di ) continue; // get this int32_t a = di->m_a; // skip if in url if ( a < 0 ) continue; // get section date is in, if any Section *sa = m_sections->m_sectionPtrs[a]; // sanity check di->m_section = sa; // set tag hash di->m_tagHash = sa->m_tagHash; di->m_turkTagHash = sa->m_turkTagHash32; // telescope until we hit a "real" section for ( ; sa ; sa = sa->m_parent ) { // get parent //Section *pa = sp->p_parent; // skip section if exactly contained by parent //if ( sp->m_a == pa->m_a && sp->m_b == pa->m_b ) // continue; // these are not real if ( m_sections->isHardSection(sa) ) break; } di->m_hardSection = sa; } // // . kill dates in bad sections // . had to move this from parseDates() since it is called // from Sections::set() now and can not use the Sections class // . the div style display none tags are SEC_HIDDEN now sec_t badFlags =SEC_MARQUEE|SEC_STYLE|SEC_SCRIPT|SEC_SELECT| SEC_HIDDEN|SEC_NOSCRIPT; for ( int32_t i = 0 ; i < m_numDatePtrs ; i++ ) { // breathe QUICKPOLL ( m_niceness ); // int16_tcut Date *di = m_datePtrs[i]; // skip if none if ( ! di ) continue; // get section Section *sd = di->m_section; // skip if none if ( ! sd ) continue; // skip if not bad if ( ! ( sd->m_flags & badFlags ) ) continue; // kill it otherwise, like if in